[Corpora-List] The Annotated Gumar Corpus - Public Release

Salam Khalifa salam.khalifa at gmail.com
Tue Aug 4 10:58:57 CEST 2020

We at the CAMeL lab at NYU Abu Dhabi are happy to announce the release of the Annotated Gumar Corpus. The Annotated Gumar Corpus is a manually annotated corpus of Gulf Arabic, specifically Emirati Arabic. The corpus consists of 200,000 words selected from eight different novels from the Gumar Corpus. Each word is annotated in context for tokenization, part-of-speech, lemmatization, spelling adjustment, English glosses, and sentence level dialect identification.

The release includes the data presented in Khalifa et al. (2018): Khalifa, Salam, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim, and Meera Al Kaabi. A Morphologically Annotated Corpus of Emirati Arabic. https://www.aclweb.org/anthology/L18-1607.pdf

The data was also used to develop the first morphological disambiguation system for Gulf Arabic as presented and discussed in Khalifa et al. (2020): Khalifa, Salam, Nasser Zalmout, and Nizar Habash. Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods. https://www.aclweb.org/anthology/2020.lrec-1.480.pdf

The Annotated Gumar Corpus is available to download from: https://camel.abudhabi.nyu.edu/annotated-gumar-corpus/

Best, *Salam Khalifa* Research Assistant @ CAMeL <http://www.camel-lab.com> NYU Abu Dhabi -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2441 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200804/a43dd7d3/attachment.txt>

More information about the Corpora mailing list