[Corpora-List] Call for Participation - Second VarDial Evaluation Campaign co-located with COLING 2018

Zampieri, Marcos M.Zampieri at wlv.ac.uk
Mon Feb 5 17:08:37 CET 2018

Call for Participation - Second VarDial Evaluation Campaign

Within the scope of the VarDial workshop, co-located with COLING 2018, we are organizing an evaluation campaign on similar languages, varieties and dialects with multiple shared tasks.

URL: http://alt.qcri.org/vardial2018/index.php?id=campaign

We are organizing five shared tasks this year:

- (ADI) Arabic Dialect Identification: The third edition of the ADI task will address the multi-dialectal challenge in spoken Arabic in broadcast news domain. Previously, we have shared acoustic features and lexical word sequence extracted from large-vocabulary speech recognition (LVCSR). This year, we will add phonetic features, which will enable researchers to use both prosodic and phonetic features, which are helpful for distinguishing between different dialects.

- (GDI) German Dialect Identification: After a successful first edition of the (Swiss) German Dialect Identification task at VarDial 2017, we are organizing a second iteration of this task. We will again focus on four Swiss German dialect areas (Basel, Bern, Lucerne, Zurich), with the addition of a fifth area subject to data availability. We will provide updated and expanded speech transcripts for all dialect areas, and also release corresponding acoustic data as well as (predicted) part-of-speech tags.

- (MTT) Morphosyntactic Tagging of Tweets: This task focuses on morphosyntactic annotation (900+ labels) of non-canonical Twitter varieties of three South-Slavic languages -- Slovene, Croatian, and Serbian. Task participants will be provided with large manually annotated and raw canonical datasets, as well as small manually annotated Twitter datasets.

- (DFS) Discriminating between Dutch and Flemish in Subtitles: The task focuses on determining whether a text is written in the Netherlandic or the Flemish variant of the Dutch language. For this task, participants are provided with a dataset consisting of almost 100,000 professionally produced subtitles for movies, documentaries and television shows.

- (ILI) Indo-Aryan Language Identification: This task focuses on identifying 5 closely-related languages of the Indo-Aryan language family – Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. These languages form part of a continuum starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar (Bhojpuri and Magahi). For this task, participants will be provided with a dataset of approximately 15,000 sentences in each language, mainly from the domain of literature, published over the web as well as in print.

To participate and to receive the training data please fill the registration form available at the workshop website. The test sets will be released on March 12, 2018.

Best, Marcos on behalf of the VarDial organizers

----- Dr. Marcos Zampieri Research Fellow Research Group in Computational Linguistics University of Wolverhampton, UK http://pers-www.wlv.ac.uk/~u22984/

More information about the Corpora mailing list