[Corpora-List] 2nd CfP: Grammar Data Mining (GDM): Extracting Linguistic Features From Grammatical Descriptions, September 5-6, 2019 - Varna, Bulgaria

Harald Hammarström harald at bombo.se
Wed Jun 19 17:54:52 CEST 2019


2nd Call for papers

Grammar Data Mining (GDM): Extracting Linguistic Features From Grammatical Descriptions

September 5-6, 2019 - Varna, Bulgaria

Submission deadline: 30 June 2019

Link: https://spraakbanken.gu.se/lsi/sharedtask/

Description -----------

The present Workshop/Shared Task seeks to transform a large set of digitized publications describing the grammars of the languages of the world into structured databases that will enable comparison of different languages at an unprecedented breadth and depth.

There are some 6 500 languages in the world and information about their grammatical characteristics is available in book-form for over 4 000 of them. Until recently, extraction of information from grammars has been done exclusively through manual collection. This procedure is naturally bounded by the limits of human capacities, and as such can only target a relatively small amount of languages/characteristics at a substantial time investment in a given time.

We are now entering a phase where it is practical to use NLP tools for a number of similar tasks. A computer may minimally infer some characteristics of the language described simply by counting words used in a grammatical description, e.g., a high-frequency of the term ’suffix’ likely indicates that the language being described uses a lot of suffixes. Further, there are less straightforward or more detailed characteristics traditionally of interest to linguists, such as where the verb is placed in then sentence (beginning, middle, end), the existence and use of participles, possessive constructions, evidentiality and so on. Any techniques from the NLP toolbox such as td-idf-weighting, tagging, parsing and vector spaces may be used in combination and as input in more sophisticated Machine Learning approaches.

In this shared task we provide a subset of the World Atlas of Language Structures (WALS, http://wals.info) along with the digitized sources from which the features were drawn. Sources are provided in raw text form. The task is to infer WALS datapoints from the raw text data of the digitized grammatical descriptions.

Training Data ------------- 10 000 datapoints spanning 191 languages and 100 features along with their value and source(s) are given as training in the following form:

Language ISO 639-3 Feature Value Source ---------------------------------------------------------------------------------------- Macushi mbc 31A Sex-based and Non-sex-based

Abbott-1991[105-106]

Non-sex-based Gender Systems Macushi mbc 57A Position of Pronominal Possessive prefixes Abbott-1991[85,101];

Possessive Affixes Williams-1932[61];

Carson-1982[104-106] E. Oromo hae 118A Predicative Adjectives Mixed

Owens-1985 E. Oromo hae 9A The Velar Nasal No velar nasal Owens-1985[10] ... ... ... ... ...

Features and values are defined as per WALS (http://wals.info). Sources are semi-colon separated and optionally indicate a page range in square brackets. Each source maps uniquely to an entry with bibliographical details in a bibtex-file and to a full-text of the source in question. The full-text is an OCR of a scan of the original source (varying quality) and contains no formatting. OCR errors are present, especially for IPA- or non-ascii-script text in a vernacular. There is a total of 443 source texts supplied.

The training data can be downloaded at http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip

Task ----

The task is to provide the Value for an unseen Language-Feature-Source triple.

No language-specific data source external to the training data (such as the classifcation of a language, other sources for a language etc.) may be used. However, other open generic linguistic data sources may be utilized (such as the raw text of the corresponding WALS chapter, a list of linguistic terms etc.).

Not every possible value for every feature is attested in the training data set but systems should nevertheless strive to potentially output any of the possible values for a features as defined in WALS. It is not obligatory that the training set values are utilized at all.

Submission Instructions -----------------------

Authors should submit a paper of up to 8 pages conforming to the RANLP style guidelines (see http://lml.bas.bg/ranlp2019/submissions.php) describing their technical solution to the specific task. The submission should contain a link to a runnable version (e.g. on github.com) of the authors’ solution. This runnable should output a Value (and nothing else) upon running the system: e.g. Given a language-code, the feature of interest, and the source document, the system should output the feature value as examplified below:


>>>python grammar-data-mining.py "hae" "118A Predicative Adjectives"
"Owens-1985; Heine 1981" Mixed

Submission is electronic, using the Softconf submission system for the Grammar Data Mining Workshop at https://www.softconf.com/ranlp2019/GDM/

Papers must be written in English.

Submitted papers will be peer-reviewed by three experts from a related field.

At least one author of each accepted paper is required to register for the RANLP 2019 conference, attend the workshop, and present the paper.

Important Dates ---------------

Workshop paper submission deadline: 30 June 2019 Workshop paper acceptance notification: 28 July 2019 Workshop paper camera-ready version: 20 August 2019 Workshop: 5-6 September 2019

Evaluation ----------

Each submission will be evaluated against a test set of 1000 random datapoints drawn from the same origin as the training data set. The test set will not be made available until after submission. Other aspects than accuracy (such as running time) will not be evaluated.

Programme Committee -------------------

Guillaume Segerer (CNRS, LLACAN, France) Harald Hammarström (Department of Linguistics and Philology, Uppsala University, Sweden) Markus Forsberg (Språkbanken, University of Gothenburg, Sweden) Søren Wichmann (Leiden University Centre for Linguistics, Netherlands) Shafqat Mumtaz Virk (Språkbanken, University of Gothenburg, Sweden) Zeljko Agic (IT University of Copenhagen, Denmark) Erich Round (University of Queensland, Australia) Sebastian Nordhoff (LangSci Press, Germany)

Venue -----

The workshop will be co-located with RANLP http://lml.bas.bg/ranlp2019 in Bulgaria and take place in Hotel "Cherno More", Varna, the main RANLP-2019 conference venue. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9479 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190619/f48117b8/attachment.txt>



More information about the Corpora mailing list