Second call for participation
(Apologies for cross-posting)
The third edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying **verbal MWEs** in running texts. Verbal MWEs include, among others, verbal idioms (to let the cat out of the bag), light-verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (s'évanouir 'to faint' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, overlaps, non-compositionality, heterogeneity and syntactic variability.
Editions 1.0 <http://multiword.sourceforge.net/sharedtask2017/> (2017) and 1.1 <http://multiword.sourceforge.net/sharedtask2018/> (2018) have shown that, while some systems reach high performance (F1>0.7) for identifying VMWEs that were seen in training corpus, performance on unseen VMWEs is very low (F1<0.2). Hence for this third edition, **emphasis will be put on discovering VMWEs that were not seen in the training corpus**.
We kindly ask potential participant teams to register using the expression of interest form:
Task updates and questions will be posted on the shared task website:
and announced on our public mailing list (anyone can join):
#### Publication and workshop
Shared task participants will be invited to submit a system description paper to a special track of the Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX 2020), at COLING 2020, to be held on September 14, 2020, in Barcelona, Spain:
Submitted system description papers must follow the workshop submission instructions and will go through double-blind peer reviewing. Their acceptance depends on the quality of the paper rather than on the ranking in the shared task. Authors of the accepted papers will present their work as posters/demos in a dedicated session of the MWE-LEX 2020 workshop. The submission of a system description paper is not mandatory.
Due to double blind review, participants are asked to provide a nickname (i.e. a name that does not identify authors, universities, research groups etc.) for their systems when submitting results and system description papers.
#### Provided corpora
The PARSEME team is preparing corpora in which VMWEs were manually annotated: https://gitlab.com/parseme/corpora/wikis/home. The provided annotations follow the PARSEME 1.1 guidelines: https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.1/.
On March 18, 2020, we will release, for each language:
* a training corpus manually annotated for VMWEs;
* a development corpus to tune/optimize the systems' parameters ; and
* a syntactically-parsed raw corpus, not annotated for VMWEs, to support semi- and unsupervised methods for VMWEs discovery (for each language, the size will be between 10 million tokens and 2 billion tokens)
On April 28, 2020, we will release, for each language:
* A blind test corpus to be used as input to the systems during the evaluation phase, during which the VMWE annotations will be kept secret.
Morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies) are also provided, both for annotated and raw corpora. Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2) or from automatic parsers trained on UD v2 treebanks (e.g., UDPipe).
The annotated training and development corpora will be released in the CUPT format <http://multiword.sourceforge.net/cupt-format/>. The raw corpus will be released in the CoNLL-U format <https://universaldependencies.org/format>. The blind test corpus will be released in the CUPT format, with underspecified 11th column to be predicted. Reference annotations for the test copus will be released after the evaluation phase.
A small trial data set is available on the shared task's release repository: https://gitlab.com/parseme/sharedtask-data/-/tree/master/1.2/trial
Corpora are being prepared for the following languages: Bulgarian (BG), Croatian (HR), German (DE), Greek (EL), Basque (EU), French (FR), Irish (GA), Hebrew (HE), Hindi (HI), Hungarian (HU), Italian (IT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Swedish (SV), Turkish (TR), Chinese (ZH).
The amount of annotated data in the training, development, test, and raw corpus depends on the language.
System results can be submitted in two tracks:
* Closed track: Systems using only the provided training and development corpora (with VMWE and morpho-syntactic annotations) + provided raw corpora.
* Open track: Systems using or not the provided training corpus, plus any additional resources deemed useful (MWE lexicons, symbolic grammars, wordnets, other raw corpora, word embeddings and language models trained on external data, etc.). This track includes notably purely symbolic and rule-based systems.
In both tracks, the use of the corpora from the previous PARSEME shared tasks is strictly forbidden, as material may have moved during corpus splits.
Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.
#### Evaluation metrics
Participants will provide the output produced by their systems on the test corpus in the CUPT format, with the 11th column containing their predictions. This output will be compared with the gold standard (ground truth) using both generic and specialised precision, recall and F1 scores.
The evaluation metrics will be the same as for the 1.1 edition, as described in:
Note that for the 1.2 edition the published general ranking will emphasize 3 metrics:
* global MWE-based
* global Token-based
* unseen MWE-based
A VMWE from the test corpus is considered seen if a VMWE with the same (multi-)set of lemmas is annotated at least once in the training corpus.
#### Corpus split
For each language, the annotated sentences will be shuffled and split, in a way which ensures that there is a minimum of 300 VMWEs in the test set which are unseen in the training + dev sets. This means that the natural sequence of sentences in a document will not be respected in the proposed corpus split. Note the unseen ratio, that is, the proportion of unseen VMWEs wrt all VMWEs in the test set, may vary across languages. In both tracks, the use of previous shared task editions' corpora is strictly forbidden, as material may have moved during corpus splits.
#### Important dates
* Feb 19, 2020: trial data and evaluation script released
* Mar 18, 2020: training and development corpus + raw corpus released
* Apr 28, 2020: blind test corpus released
* Apr 30, 2020: submission of system results
* May 06, 2020: announcement of results
* May 20, 2020: shared task system description papers due (same as regular papers)
* Jun 24, 2020: notification of acceptance
* Jul 11, 2020: camera-ready system description papers due
* Sep 14, 2020: shared task session at the MWE-LEX 2020 <http://multiword.sourceforge.net/mwelex2020> workshop at Coling 2020
#### Organizing team
Carlos Ramisch, Marie Candito, Bruno Guillaume, Agata Savary, Ashwini Vaidya, and Jakub Waszczuk
Contact: parseme-st-core at nlp.ipipan.waw.pl -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 38017 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200217/eb6140bc/attachment.txt>