[Corpora-List] Call for Participation: TOOL CONTEST ON POS TAGGING FOR CODE-MIXED INDIAN SOCIAL MEDIA (FACEBOOK, TWITTER, AND WHATSAPP) TEXT @ ICON 2016

Amitava Das amitava.santu at gmail.com
Thu Jul 7 16:14:31 CEST 2016


TOOL CONTEST ON POS TAGGING FOR CODE-MIXED INDIAN SOCIAL MEDIA (FACEBOOK, TWITTER, AND WHATSAPP) TEXT @ ICON 2016

==================================================================================================================

http://amitavadas.com/Code-Mixing.html

http://ltrc.iiit.ac.in/icon2016/

The evolution of social media texts such as blogs, micro-blogs (e.g., Twitter), WhatsApp, and chats (e.g., Facebook messages) has created many new opportunities for information access and language technology, but also many new challenges, making it one of the prime present-day research areas. Non-English speakers, especially Indians, do not always use Unicode to write something in social media in ILs. Instead, they use phonetic typing/ roman script/ transliteration and frequently insert English words or phrases through code-mixing and anglicisms (see the following example [1]), and often mix multiple languages to express their thoughts. While it is clear that English still is the principal language for social media communications, there is a growing need to develop technologies for other languages, including Indian languages. India is home to several hundred languages. Language diversity and dialect changes instigate frequent code-mixing in India. Hence, Indians are multi-lingual by adaptation and necessity, and frequently change and mix languages in social media contexts, which poses additional difficulties for automatic Indian social media text processing. Part-of-speech (POS) tagging is an essential prerequisite for any kind of NLP applications. This year we will continue the last year’s POS tagging shared-task on three widely spoken Indian languages (Hindi, Bengali, and Telugu), mixed with English.

Example 1: ICON 2016 Varanasi me hold hoga! Great chance to see the pracheen nagari!

THE CONTEST

Participants will be provided training, development and test data to report the efficiency of their POS tagging system. English-Hindi, English-Bengali, and English-Telugu language mixing will be explored. The datasets may be provided with some additional information like the languages of each word. Efficiency will be measured in terms of Precision, Recall, and F-measure. Shortlisted candidates will present their techniques and results in a special session at ICON 2016.

The contest will have three prizes:

FIRST PRIZE: Rs.10,000/-

SECOND PRIZE: Rs.7,500/-

THIRD PRIZE: Rs.5,000/-

WHATS NEW THIS YEAR

We are releasing code-mixed WhatsApp data for 3 language pairs: English-Hindi, English-Bengali, and English-Telugu. Possibly this is the first time NLP related issue on WhatsApp messages is being discussed. WhatsApp messages are relatively much smaller than Facebook and Twitter messages, therefore more challenging. Hopefully it will be a exciting!

THE TASK

The contest task is to predict POS tags at word level, whereas language tags (en, hi/bn/te, univ {symbols, @ mentions, hashtags}, mixed {word level mixing like jugading}, acro {lol, rofl, etc}, ne, undef) at word level will be given. There will be two tracks: fine grained a coarse-grained tagset (Google universal tagset: http://www.dipanjandas.com/files/lrec.pdf). Fine-grained tagset and their mapping to coarse-grained tagset is mentioned in our our RANLP paper (http://amitavadas.com/Pub/AJ_RANLP_2015.pdf).

Each team may submit up to 4 runs, one constrained (*2 for fine-grained and coarse-grained) and one unconstrained (*2 for fine-grained and coarse-grained).

Constrained: Means the participant team is only allowed to use our corpus for the training. No external resource is allowed.

Unconstrained: Means the participant team can use any external resource (available POS tagger, NER, Parser, and any additional data) to train their system. Accordingly they have to mention those resources explicitly in their task-report.

WINNER SELECTION

Team will be doing best in all the language pairs using only our data (constrained) will be the winner. All the unconstrained submission will used for the academic discussion during the session.

** Note: teams can use ICON 2015 data as additional resource, but the submission will be considered as unconstrained.

DATA

Training data for Twitter (1K), Facebook (1K) and WhatsApp (1K) will be release for all the 3 language pairs: English-Hindi, English-Bengali, and English-Telugu. Although for bi- or multi-linguals code-mixing is a natural practice, but what is the actual distribution of code-mixing in any social-media corpus is an important question. We have observed that monolingual English and romanized Indian languages (ILs) messages are also equally prevalent in social media. For this contest we discarded almost all the monolingual English messages, as there are other research efforts and forums, where the actual research problem with English social media has been discussed extensively. Here we will be concentrating only on code-mixed En-ILs and monolingual ILs.

While two languages are blending, another important question might be raised is which language is mixing in what. To keep our data balanced we keep an equal distribution of utterances where English mixed in ILs and ILs mixed in English.

Although our corpus is mostly bi-lingual mix but there are utterances with tri- quad-lingual mix. For example in the English-Bengali corpus there are significant Hindi word mix, whereas in the English-Telugu data there are significant Tamil and Hindi mix.

Thanks, Amitava ---------------------------------------------------------------- DR. AMITAVA DAS INDIAN INSTITUTE OF INFORMATION TECHNOLOGY (IIITS) SRI CITY, AP, INDIA Web Page: http://www.amitavadas.com/ ---------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6917 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160707/27733d21/attachment.txt>



More information about the Corpora mailing list