The First Workshop on Subword and Character LEvel Models in NLP (SCLeM)
To be held at EMNLP 2017 in Copenhagen on September 7, 2017
Submission deadline: *June 10, 2017 (NEW)*Notification: June 30, 2017 Camera-ready: July 14, 2017 Workshop: September 7, 2017
Kyunghyun Cho, NYU Karen Livescu, TTIC Tomas Mikolov, Facebook Noah Smith, Univ of Washington
INVITED TUTORIAL TALK
Neural weighted finite-state machines, Ryan Cotterell, JHU
Traditional NLP starts with a hand-engineered layer of representation, the level of tokens or words. A tokenization component first breaks up the text into units using manually designed rules. Tokens are then processed by components such as word segmentation, morphological analysis and multiword recognition. The heterogeneity of these components makes it hard to create integrated models of both structure within tokens (e.g., morphology) and structure across multiple tokens (e.g., multi-word expressions). This approach can perform poorly (i) for morphologically rich languages, (ii) for noisy text, (iii) for languages in which the recognition of words is difficult and (iv) for adaptation to new domains; and (v) it can impede the optimization of preprocessing in end-to-end learning.
The workshop provides a forum for discussing recent advances as well as future directions on sub-word and character-level natural language processing and representation learning that address these problems.
09:00-09:10 Welcome 09:10-09:50 Invited talk 1 (Tomas Mikolov) 09:50-10:30 Invited talk 2 (Noah Smith) 10:30-11:00 Coffee break 11:00-11:40 Invited tutorial talk (Ryan Cotterell) 11:40-12:10 Best paper presentations 12:10-14:00 Poster session & Lunch break 14:00-14:40 Invited talk 3 (Kyunghyun Cho) 14:40-15:40 Poster session & Coffee break 15:40-16:20 Invited talk 4 (Karen Livescu) 16:20-17:30 Panel discussion 17:30-17:45 Closing remarks
TOPICS OF INTEREST
- tokenization-free models - character-level machine translation - character-ngram information retrieval - transfer learning for character-level models - models of within-token and cross-token structure - NL generation (of words not seen in training etc) - out of vocabulary words - morphology & segmentation - relationship b/w morphology & character-level models - stemming and lemmatization - inflection generation - orthographic productivity - form-meaning representations - true end-to-end learning - spelling correction - efficient and scalable character-level models
SUBMISSIONS OF LONG AND SHORT PAPERS AND EXTENDED ABSTRACTS
Please submit your paper using START:https://www.softconf. com/emnlp2017/sclem/
Submissions must be in PDF format, anonymized for review, written in English and follow the EMNLP 2017 formatting requirements (available at http://emnlp2017.net/).
We strongly advise you use the LaTeX template files provided by EMNLP 2017.
Long paper submissions consist of up to eight pages of content. Short paper submissions consist of up to four pages of content. There is no limit on the number of pages for references. There is no extra space for appendices. Accepted papers will be given one additional page for content.
Authors can also submit extended abstracts of up to eight pages of content. Add "EXTENDED ABSTRACT)" to the title of an extended abstract submission. Extended abstracts will be presented as talks or posters if selected by the program committee, but not included in the proceedings. Thus, your work will retain the status of being unpublished and later submission at another venue (e.g., a journal) is not precluded.
Manaal Faruqui, Google Hinrich Schuetze, LMU Munich Isabel Trancoso, INESC-ID/IST Yadollah Yaghoobzadeh, LMU Munich -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 12533 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170602/ca2647b3/attachment.txt>