[Corpora-List] Final Call for Papers: 4th Web as Corpus Workshop (LREC 2008, Marrakech)

Stefan Evert stefan.evert at uos.de
Fri Feb 22 17:45:34 CET 2008

===== Second Call for Papers =====

The 4th Web as Corpus workshop: Can we beat Google?

Marrakech, Morocco (post-LREC workshop) 1 June 2008



Submission deadline: 29 February 2008

PAPER SUBMISSION: http://www.easychair.org/conferences/?conf=wac4



Commercial Web search engines offer fast search on huge amounts of text, combined with increasingly clever ranking and data analysis algorithms, but their content-centric services do not cater to the needs of the computational linguistics and NLP communities. The leading theme of this workshop, the fourth in a row of highly successful Web as Corpus meetings, is to find out how to combine the power and scalability of modern search engine technology with sophisticated linguistic annotation and query processing.

We invite papers on various topics concerning the use of Web resources for corpus research and NLP applications, including (but not limited to) the following:

* linguistic Web crawler technology and Web corpus collection projects

* applications of Web-derived corpora and other kinds of Web data

* how far does the "easy way" get you? (using search engines, or Google's n-gram lists; we are particularly interested in a critical discussion of the usefulness and limitations of such approaches)

* methods and tools for "cleaning" Web pages to turn them into a corpus (contributors to this topic will be encouraged to participate in the second CLEANEVAL competition to be held in 2009)

* automatic linguistic annotation of Web data: tokenisation, POS tagging, lemmatisation, semantic tagging, etc. (established tools often perform very poorly on Web data)

* search engine architectures for linguists: bringing linguistics to commercial search engines, or high-performance search technology to linguistics?

* search engine-related topics such as result ranking (e.g. how to identify "typical" uses rather than returning 50 very similar matches on the first page)

* duplicate detection, interactive query refinement, etc.

* reviews and clever uses of search engine APIs (Google, Yahoo, Altavista, and in particular Microsoft's current generous LiveSearch API)

This workshop is endorsed by the Special Interest Group on the Web as Corpus (SIGWAC) of the Association for Computational Linguistics (ACL).


Authors are invited to submit full papers on original, unpublished work in the topic area of this workshop. Submissions should follow the format of LREC proceedings and should not exceed eight (8) pages, including references. We strongly recommend the use of LREC LaTeX or Microsoft Word style files tailored for this year's conference.

Submissions are managed via EasyChair.org. In order to submit a paper, go to:


and login (or register an account with EasyChair if you don't have one yet). After logging in, click 'New Submission' and fill in the standard fields.


Silvia Bernardini, U of Bologna, Italy Massimiliano Ciaramita, Yahoo! Research Barcelona, Spain Jesse de Does, INL, Netherlands Katrien Depuydt, INL, Netherlands Stefan Evert, U of Osnabrück, Germany Cédrick Fairon, UCLouvain, Belgium William Fletcher, U.S. Naval Academy, USA Gregory Grefenstette, Commissariat ŕ l'Énergie Atomique, France Péter Halácsy, Budapest U of Technology and Economics, Hungary Katja Hofmann, U of Amsterdam, Netherlands Adam Kilgarriff, Lexical Computing Ltd, UK Igor Leturia, Elhuyar Fundazioa, Basque Country, Spain Phil Resnik, U of Maryland, College Park, USA Kevin Scannell, Saint Louis U, USA Gilles-Maurice de Schryver, U Gent, Belgium Klaus Schulz, LMU München, Germany Serge Sharoff, U of Leeds, UK Eros Zanchetta, U of Bologna, Italy


Stefan Evert, University of Osnabrück Adam Kilgarriff, Lexical Computing Serge Sharoff, University of Leeds

More information about the Corpora mailing list