[Corpora-List] SIGTYP 2020 Shared Task on the Prediction of Typological Features

Ekaterina Vylomova evylomova at gmail.com
Wed Apr 8 06:40:25 CEST 2020

In 2020, SIGTYP is offering a shared task on the Prediction of Typological Features. The shared task encompasses nearly 2,000 languages, with typological features taken from the World Atlas of Language Structures (WALS; Dryer and Haspelmath 2013).

To participate in the shared task, you will build a system that can predict typological properties of languages, given a handful of observed features. Training examples and development examples have already been provided (see link below). All submitted systems will be compared on a held-out test set.

Moreover, you will be invited to describe your system in a system paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarises the different approaches taken, and their results.

*Important Links*

- Download Train and Dev data: https://github.com/sigtyp/ST2020/tree/master/data - Register for the Task: https://sigtyp.github.io/st2020-reg.html

*Important Dates*

- Training data Release: 26 March 2020 - Test data Release: 20 June 2020 - Submissions Due: 1 July 2020 - Writeup Due: 1 August 2020


The typological features in WALS represent one approach to the categorization of the languages of the world according to their linguistic properties, e.g. in terms of their syntax, morphology, phonology inter alia. One example of such a typological feature is the basic word order feature. For instance, English is best described as a subject-verb-object (SVO) language whereas Japanese is best described as a subject-object-verb (SOV) language.

One major issue with WALS, however, is that it is both sparse and skewed in terms of language-feature annotations. It is sparse in the sense that most languages only have annotations for a handful of features, and skewed in the sense that a few features have much wider coverage than others. Luckily, such features often correlate with one another, which allows for prediction of those features from others. For instance, languages where the verb precedes the object tend to have prepositions, e.g. Norwegian, whereas languages where the object precedes the verb word tend to have postpositions, e.g. Japanese.

Although there is a significant amount of previous work dealing with versions of this task (Daumé III and Campbell 2017; Bjerva et al. 2019; Ponti et al. 2019), important design choices have been frequently ignored. Some papers controlled for genetic relationships between training and evaluation languages, but little-to-no work has considered controlling for geographical proximity.

The shared task will consist of two settings (subtasks):

- Constrained

only provided training data can be employed.

- Unconstrained

training data can be extended with any external source of information

(e.g. pre-trained embeddings, raw texts, etc.)

Read More: https://sigtyp.github.io/st2020.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3239 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200408/10117b34/attachment.txt>

More information about the Corpora mailing list