[Corpora-List] SemEval 2020 Task 9: Sentiment Analysis for Code-Mixed social media text

Amitava Das amitava.santu at gmail.com
Mon Sep 16 10:52:46 CEST 2019

*http://www.amitavadas.com/SentiMix.html <http://www.amitavadas.com/SentiMix.html>*




Mixing languages, also known as code-mixing, is a norm in multilingual societies. Multilingual people, who are non-native English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. In addition to mixing languages at the sentence level, it is fairly common to find the code-mixing behavior at the word level. This linguistic phenomenon poses a great challengeto conventional NLP systems, which currentlyrely on monolingual resources to handle the combination of multiple languages. The objective of this proposal is to bring the attention of the research community towards the task of sentiment analysis in code-mixed social media text. Specifically, we focus on the combination of English with Spanish (Spanglish) and Hindi (Hinglish), which are the 3rd and 4th most spoken languages in the world respectively

*Hinglish and Spanglish - the Modern Urban Languages*

The evolution of social media texts such as blogs, micro-blogs (e.g., Twitter), and chats (e.g., WhatsApp and Facebook messages) has created many new opportunities for information access and language technology, but it has also posed many new challenges making it one of the current prime research areas. Although current language technologies are primarily built for English, non-native English speakers combine English and other languages when they use social media. In fact, statistics show that half of the messages on Twitter are in a language other than English (Schroeder, 2010). This evidence suggests that other languages, including multilinguiality and code-mixing, need to be considered by the NLP community.

Code-mixing poses several unseen difficulties to NLP tasks such as word-level language identification, part-of-speech tagging, dependency parsing, machine translation and semantic processing. Conventional NLP systems heavily rely on monolingual resources to address code-mixed text, which limit them to properly handle issues like English-based phonetic typing, word-level code-mixing, and others. The next two phrases are examples of code-mixing in Spanglish and Hinglish. For the Spanglish example, in addition to the code-mixing at the sentence level, the word pushes conjugates the English word push according to the grammar rules in Spanish, which shows that code-mixing can also happen at the word level. Better to add more details on the Hinglish example In the Hinglish example only one English word enjoy has been used, but more noticeably for the Hindi words - instead of using Devanagari script, English phonetic typing is a popular practice in India.

*No me pushes, please* *Eng. Trans.:* Do not push me, please

*Aye aur enjoy kare* *Eng. Trans.: *come and enjoy

Additionally, code-mixing frequently occurs in informal settings like social media platforms, which brings more challenges such as flexible grammar, creative spelling, arbitrary punctuation, slang, genre-specific terminology and abbreviations, among others. This whole scenario opens up new research lines where the focus goes beyond simply combining monolingual resources to address the multilingual code-mixing phenomenon in social media environments.

Perhaps we could articulate better this paragraph and merge it with the last paragraph Naturally, code-mixing is more common in geographical regions with a high percentage of bi- or multilingual speakers, such as in Texas and California in the US, Hong Kong and Macao in China, many European and African countries, and the countries in South-East Asia. Multi-linguality and codemixing are also very common in India. Here we propose a sentiment analysis shared task on codemixed social media text on the 3rd and 4th widely spoken languages i.e. Spanish and Hindi mixed with English.

Although code-mixing has received enough attention recently, the availability of properly annotated data is still under scarcity. In this shared task, we will be releasing 20K annotated tweets with word-level language and tweet-level sentiment labels. We believe that the wide interest on the task of sentiment analysis can attract the attention of the NLP community to the code-mixing phenomenon.

*The SentiMix task - A summary*

The task is to predict the sentiment of a given code-mixed tweet. The sentiment labels are *positive, negative, or neutral *, and the code-mixed languages will be English-Hindi and English-Spanish. Besides the sentiment labels, we will also provide the language labels at the word level. The word-level language tags are *en* (English), *spa* (Spanish), *hi* (Hindi), *mixed*, and *univ* (e.g., symbols, @ mentions, hashtags). Table 1 shows examples of annotated tweets. If we run out of space, we can uncomment the small command in the table Participants will be provided training, development and test data to report the efficiency of their sentiment analysis systems. Efficiency will be measured in terms of Precision, Recall, and F-measure.

*Data & Resources*

The organizing team has collected and annotated the corpus. The source of the corpus is social media, specifically Twitter. For the purpose of collection of code-mixed data, an extensive list of Twitter handles and pages exhibiting a lot of codemixing was prepared. For both the language pairs Spanglish and Hinglish corpus of 20,000 annotated tweets will be released. The data is annotated with tweet level sentiment and word-level language.

To ensure the quality of the annotation the data is annotated semi-automatically. The baseline word level language identifier and tweet level sentiment analyzer have been used to obtain a basic annotations and then it has crowdsourced to obtain the correct annotations. Finally, a manual quality checking has been done for each crowdsource annotator. Quality threshold under a specified thresold has been discarded. Each tweet has been annotated at least by two crowdsource annotators and further manually checked. Finally, only those tweets are chosen for which inter annotator agreement {*kappa score*} is above 0.8.

*Pilot Task* Code-mixing has received significant research attention in the last few years. There has been three successful series of workshops on Computational Approaches to Linguistic Code-Switching (CALCS). In EMNLP 2014, the first CALCS workshop (Solorio et al., 2014) received a total of 18 regular workshop submissions from which 8 were accepted. Additionally, 7 teams participated in the shared task on language identification. In EMNLP 2016, the second CALCS series (Molina et al., 2016) got 19 regular workshop submissions from which 17 were accepted, and 9 teams participated on the shared task. In ACL 2018, the third CALCS workshop (Aguilar et al., 2018) received 19 regular workshop submissions from which 11 were accepted. The shared task on named entity recognition got 9 teams. Thamar, one of the proposers of current task, is the organizer of the CALCS workshop series.

There were 4 (SahaRoy et al., 2013; Choudhury et al., 2014; Sequiera et al., 2015; banerjee et al., 2016) successful series of Mixed Script In- formation Retrieval have been organized with Forum for Information Retrieval Evaluation (FIRE). The tasks addressed rage of issues - focused on word-level language identification, IR for CodeMixing languages, question-answering for CodeMixing languages. Amitava, one of the proposers of the task was one of the organizers. In all the successive years 10+ teams participated in the task series.

Two successful shared tasks on POS tagging for Code-Mixing languages have been organized with the International Conference on Natural Language Processing (ICON) in 2015, and 2016 (Das, 2015- 2016). Amitava, one of the proposers of the task was the organizer. Altogether, 5 and 7 teams participated in 2015, and 2016 respectively.

Despite of these successful workshops and events, we feel that more efforts are needed, and SemEval is the ideal forum to organize the sentiment analysis on code-mixing language task.

*Expected Impacts*

Although Code-Mixing has received enough attention in recent years, but availability of properly annotated data is still under scarcity. In this shared task, we will be releasing 20K annotated data with word-level language marking and sentiment tagged at the tweet level. Although the task will mainly be focusing on sentiment analysis problem, but the data will be serving the NLP community, whoever are interested in Code-Mixing problem for these particular two languages.

*Evaluation Ranking and the Baseline*

The metric for evaluating the participating systems will be as follows. We will use F1 averaged across the positives, negatives, and the neutral. The final ranking would be based on the average F1 score.

However, for further theoritical discussion and we will release macro-averaged recall (recall averaged across the three classes), since the latter has better theoretical properties than the former (Esuli and Sebastiani, 2015), and since this provides better consistency.

Each participating team will initially have access to the training data only. Later, the unlabelled test data will be released. After SemEval-2020, the labels for the test data will be released as well. We will ask the participants to submit their predictions in a specified format (within 24 hours), and the organizers will calculate the results for each participant. We will make no distinction between constrained and unconstrained systems, but the participants will be asked to report what additional resources they have used for each submitted run. . Thanks, Amitava -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 17102 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190916/fe7be67e/attachment.txt>

More information about the Corpora mailing list