[Corpora-List] Training Data Released - Second VarDial Evaluation Campaign

Zampieri, Marcos M.Zampieri at wlv.ac.uk
Tue Mar 13 15:53:15 CET 2018

Training Data Released - Second VarDial Evaluation Campaign

Within the scope of the VarDial workshop, co-located with COLING 2018, we are organizing an evaluation campaign on similar languages, varieties and dialects with multiple shared tasks.

URL: http://alt.qcri.org/vardial2018/index.php?id=campaign

We are organizing five shared tasks this year:

- (ADI) Arabic Dialect Identification: The third edition of the ADI task will address the multi-dialectal challenge in spoken Arabic in broadcast news domain. Previously, we have shared acoustic features and lexical word sequence extracted from large-vocabulary speech recognition (LVCSR). This year, we will add phonetic features, which will enable researchers to use both prosodic and phonetic features, which are helpful for distinguishing between different dialects.

- (GDI) German Dialect Identification: After a successful first edition of the (Swiss) German Dialect Identification task in 2017, we organize a second iteration of this task. We provide updated data on the same Swiss German dialect areas as last year (Basel, Bern, Lucerne, Zurich), but add a fifth "surprise dialect" for which no training data is made available.

- (MTT) Morphosyntactic Tagging of Tweets: This task focuses on morphosyntactic annotation (900+ labels) of non-canonical Twitter varieties of three South-Slavic languages -- Slovene, Croatian, and Serbian. Task participants will be provided with large manually annotated and raw canonical datasets, as well as small manually annotated Twitter datasets.

- (DFS) Discriminating between Dutch and Flemish in Subtitles: The task focuses on determining whether a text is written in the Netherlandic or the Flemish variant of the Dutch language. For this task, participants are provided with a dataset consisting of almost 100,000 professionally produced subtitles for movies, documentaries and television shows.

- (ILI) Indo-Aryan Language Identification: This task focuses on identifying 5 closely-related languages of the Indo-Aryan language family – Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. These languages form part of a continuum starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar (Bhojpuri and Magahi). For this task, participants will be provided with a dataset of approximately 15,000 sentences in each language, mainly from the domain of literature, published over the web as well as in print.

To participate and to receive the training data (released March 12) please fill up the registration form available at the workshop website.

The VarDial workshop will take place in August 2018 in Santa Fe, USA.

Best, Marcos on behalf of the VarDial organizers

----- Dr. Marcos Zampieri Research Fellow Research Group in Computational Linguistics University of Wolverhampton, UK http://pers-www.wlv.ac.uk/~u22984/

More information about the Corpora mailing list