Please note that the deadline for regular papers has been extended until April 8 and thereby aligned with the shared task submission deadline.

ACL Workshop on

Distributional Semantics and Compositionality (DiSCo'2011)


June 24, 2011, Portland, Oregon, USA

** NEW Regular paper submission deadline: ** April 8, 2011 Test data submission and system description deadline: April 8, 2011 Notification of acceptance: Apr 25, 2011 Camera-ready deadline: May 06, 2011


* We are pleased to announce Dominic Widdows as the invited speaker at DiSCo'2011

** Workshop Description**

Any NLP system that does semantic processing relies on the assumption of semantic compositionality: the meaning of a phrase is determined by the meanings of its parts and their combination. However, this assumption does not hold for lexicalized phrases such as idiomatic expressions, which causes pain points not only for semantic, but also for syntactic processing, see (Sag et al. 2001). In particular, while distributional methods in semantics have proved to be very efficient in tackling a wide range of tasks in natural language processing, e.g., document retrieval, clustering and classification, question answering, query expansion, word similarity, synonym extraction, relation extraction, textual advertisement matching in search engines, etc. (see Turney and Pantel 2010 for a detailed overview), they are still strongly limited by being inherently word-based. While dictionaries and other lexical resources contain multiword entries, these are expensive to obtain, not available for all languages to a sufficient extent, the definition of a multiword varies across resources and non-compositional phrases are merely a subclass of multiwords. The workshop brings together researchers that are interested in extracting non-compositional phrases from large corpora by applying distributional models that assign a graded compositionality score to a phrase as well as researchers interested in expressing compositional meaning with such models. This score denotes the extent to which the compositionality assumption holds for a given expression. The latter can be used, for example, to decide whether the phrase should be treated as a single unit in applications. We emphasize that the focus is on automatically acquiring semantic compositionality. Approaches that employ prefabricated lists of non-compositional phrases should consider a different venue.

This event consists of a main session and a shared task.

References: Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copestake, Dan Flickinger (2001): Multi-word Expressions: A Pain in the Neck for NLP. In Proc. of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), Mexico City, Mexico

Turney, P. and P. Pantel. (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141-188.

** Call for Papers **

For the main session, we invite submission of papers on the topic of automatically acquiring a model for semantic compositionality. This includes, but is not limited to:

• Models of Distributional Similarity

• Graph-based models over word spaces

• Vector-space models for distributional semantics

• Applications of semantic compositionality

• Evaluation of semantic compositionality

Authors are invited to submit papers on original, unpublished work in the topic area of this workshop. In addition to long papers presenting completed work, we also invite short papers and demos:

- Long papers should present completed work and should not exceed 8 pages plus 1 page of references

- Short papers/demos can present work in progress or the description of a system, and should not exceed 4 pages plus 1 page of references.

As reviewing will be blind, please ensure that papers are anonymous. The papers should not include the authors' names and affiliations or any references to web sites, project names etc. revealing the authors' identity.

** Shared Task **

The organizers extracted candidate phrases from two large-scale freely available web-corpora, UkWaC and DeWaC (cf. http://wacky.sslmit.unibo.it/), containing respectively English and German POS tagged text. These data have been manually evaluated for compositionality with Amazon Turk. Workers were presented a sentence with a bolded target phrase and were asked to score how literal the phrase was between 0 and 10. 4-5 different, randomly sampled sentences from the WaCKy corpora for UK English and German were presented to 4 workers each.

Phrases consist of two lemmas and come in three grammatical relations:

• ADJ_NN: adjective modifying a noun

• V_SUBJ: noun as a subject of a verb

• V_OBJ: noun as an object of a verb

Phrases were extracted semi-automatically. The relations were assigned by patterns and manually checked for validity. Phrases were selected in a way as to balance the data set while controlling for frequency. The complete data was split into 40% training, 10% validation and 50% test.

More details on the data set as well as the download link to the training and validation data are available from the workshop's website (http://disco2011.fzi.de/)

Participants of the task are free to choose whatever method and data resources they will use in their submission. Prefabricated lists of multiwords are not allowed. Since the data set is derived from the WaCkY corpora, participants are strongly encouraged to use these freely available text collections to build their models of compositionality, thus ensuring the highest possible comparability of results. Furthermore, since the WaCkY corpora are provided already POS-tagged and lemmatized, the workload on the participants' side is considerably reduced. This information (POS tags and lemmatization) may or may not be used by the participants. If needed, additional linguistic annotations or processing may also be added to the corpora. For obtaining the WaCky corpora, please email us (< disco2011workshop @ gmail.com >) for instructions to minimize load on the WaCky organizers. Of course, you can also directly contact the WaCky community at http://wacky.sslmit.unibo.it/doku.php?id=start.

Participants further submit a 4 page system description for publication in the workshop volume.

** Program Committee **

• Enrique Alfonseca, Google Research, Switzerland

• Tim Baldwin, University of Melbourne, Australia

• Marco Baroni, University of Trento, Italy

• Paul Buitelaar, National University of Ireland, Ireland

• Chris Brockett, Microsoft Research, Redmond, US

• Tim van de Cruys, INRIA, France

• Stefan Evert, University of Osnabrück, Germany

• Antske Fokkens, Saarland University, Germany

• Silvana Hartmann, TU Darmstadt, Germany

• Alfio Massimiliano Gliozzo, IBM, Hawthorne, NY, USA

• Mirella Lapata, University of Edinburgh, UK

• Ted Pedersen, University of Minnesota, Duluth, USA

• Yves Peirsman, Stanford University, USA

• Sebastian Rudolph, Karlsruhe Institute of Technology, Germany

• Peter D. Turney, National Research Council Canada, Canada

• Magnus Sahlgren, Gavagai, Sweden

• Serge Sharoff, University of Leeds, UK

• Anders Søgaard, University of Copenhagen, Denmark

• Daniel Sonntag, German Research Center for AI, Germany

• Diana McCarthy, Lexical Computing Ltd., UK

• Dominic Widdows, Google, USA

Workshop Chairs:

• Chris Biemann, UKP lab, TU Darmstadt, Germany

• Eugenie Giesbrecht, FZI Research Center for Information Technology at the University of Karlsruhe, Germany

