Call for papers
General-purpose dense word embeddings have come a long way since the beginning of their boom in 2013, and they are still the most widely used way of representing words in both industrial and academic NLP systems. However, the issue of intrinsic metrics that are predictive of performance on downstream tasks, and can help to develop better representations, is far from being solved. At the sentence level and above, we now have a number of probing tasks and large extrinsic evaluation datasets targeting high-level verbal reasoning, but there is still much to learn about what features make a compositional representation successful. Last but not the least, there are no established intrinsic methods for newer kinds of representations such as ELMO, BERT, or box embeddings.
The third edition of RepEval aims to foster discussion of the above issues, and to support the search for high-quality general purpose representation learning techniques for NLP. We hope to encourage interdisciplinary dialogue by welcoming diverse perspectives on the above issues: submissions may focus on properties of embedding space, performance analysis for various downstream tasks, as well as approaches based on linguistic and psychological data. In particular, experts from the latter fields are encouraged to contribute analysis of claims previously made in NLP community.
RepEval 2019 invites submissions including, but not limited to the following issues:
* approaches to intrinsic and extrinsic evaluation of all kinds of
representations, esp. contextualized;
* evaluation motivated by linguistic, psycholinguistic or neurological
evidence, its predictive power, and interpretability of meaning
representations vs evaluation on downstream tasks;
* the (un)stability of vector representations, best practices for
reproducible and reliable experiments;
* evaluation of representations at subword level, especially for
morphologically complex languages;
* evaluation of phrase, sentence, paragraph and document-level
representations: evidence of compositionality, further diagnostic
tests, and how much the preservation of abstract syntactic
information actually contributes to performance;
* formal analysis of properties of embedding spaces and their impact
on downstream tasks;
* the effect of representations vs other elements of pipeline in
* validation of evaluation methodology and findings in cross-lingual
* specialized vs general-purpose representations, and whether the
latter have inherent limits in downstream tasks;
* internal states of end-to-end systems as meaning representations,
and ways to make more sense of them.
Types and Requirements
The workshop will accept submissions through two tracks: research and shared task. Research paper submissions may consist of 4-6 pages of content, plus unlimited references. Shared task submissions may consist of up to 4 pages of content, plus unlimited references. An additional page in the camera-ready version will be available for addressing reviewers’ comments.
Please refer to the NAACL author guidelines for the style files, policy on double submissions and preprints:https://naacl2019.org/calls/papers/#author-guidelines
The research track will showcase proposals for new evaluation techniques for old and new representations; the submissions are expected to experimentally demonstrate the benefits of the new approach. We also invite critical analysis and/or negative results for the existing approaches.
We welcome both theoretical analysis (especially from experts in other domains such as linguistics or psychology) and methodological caveats (reproducibility, parameters impact, the issue of attribution of results to the representation or the whole system, dataset structure/balance/representativeness).
Theoretical papers might like to consider the following questions:
* What are the pros and cons of existing evaluations?
* What are the limitations of task-independent representation or its
* (Mis)attribution of performance improvements to various elements of
the pipeline in complex NLP systems;
* Given a specific downstream application, which existing evaluation
(or family of evaluations) is a good predictor of performance
* Which linguistic/semantic/psychological properties are captured by
existing evaluations? Which are not?
* What methodological mistakes were made in the creation of existing
* What linguistic/psychological properties of meaning representations
are supposed to make them "better", and why?
* The recent tendency is to take high-level reasoning tasks such as QA
or inference as the "ultimate" evaluation for meaning
representations (effectively, a Turing test proxy). How justified is
this approach? Should a "good" representation excel at all such
tasks, or specialize? What alternatives do we have?
Proposal papers should introduce a novel method for evaluating representations, accompanied with a proof-of-concept dataset (of which at least a sample should be made available to the reviewers at the submission time). The new method should highlight some previously unnoticed properties of the target representations, or enables a faster/more cost-effective way of measuring some previously known properties. We also invite proposals that can demonstrate a significant improvement to the previous metrics (e.g. update to an imbalanced or noisy dataset that shows that previous claims were misattributed).
Each proposal should explicitly mention:
* Which type of representation it evaluates (e.g. word, sentence,
document, contextualized or not), and what specific properties of
that representation it targets;
* For which downstream application(s) it functions as a proxy;
* Any linguistic/semantic/psychological properties it captures, in
comparison with previous work;
* If any annotation was performed, what was the inter-annotator
agreement, and how cost-effective would it be to scale it up and/or
create a similar resource for other languages?
* If the dataset collection involved human participants, personally
identifiable information, social media or web data, please include
the relevant details for the institutional review, license terms and
permissions to use/release the data.
RepEval2019 shared task invites submissions of new word embedding models to be evaluated on MultiEval, a new suite of intrinsic and extrinsic tasks. The participants will need to train their models on the corpus that will be provided by the organizers on January 15th, and submit the resulting representations to the organizers by February 24th. The results will be made public on the workshop website by February 28th. Shared task papers (up to 4 pages + unlimited references) will be due by March 6th.
Shared task page: https://repeval2019.github.io/task/
Submission is electronic, using the Softconf START conference management athttps://www.softconf.com/naacl2019/repeval/
All accepted papers must be presented at the workshop to appear in the proceedings. At least one author of each accepted paper must register for the workshop by the early registration deadline. Previous presentations of the work (e.g. preprints on arXiv.org) should be indicated in a footnote that should be excluded from the review submission, but included in the final version of papers appearing in the NAACL-HLT 2019 proceedings.
First call for workshop papers: December 22, 2018
Second call for workshop papers: January 15, 2019
Shared task training data released: January 24, 2019
Shared task submissions due: February 24, 2019
Shared task results announced: before February 28, 2019
Workshop papers due: March 6, 2019
Notification of acceptance: March 27, 2019
Camera-ready papers due: April 5, 2019
Workshop date: June 6, 2019
All deadlines are 11.59 pm UTC -12h.
Email:repeval2019 at googlegroups.com <mailto:repeval2019 at googlegroups.com>
Workshop website: https://repeval2019.github.io/
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 39077 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190115/bc362b24/attachment.txt>