[Corpora-List] TweetLID: Corpus for Twitter language identification released

Iņaki San Vicente Roncal i.sanvicente at elhuyar.com
Wed Oct 1 14:20:49 CEST 2014


Dear Colleagues,

We are happy to announce the release of the TweetLID corpus, built for the TweetLID Twitter language identification shared task <http://komunitatea.elhuyar.org/tweetlid>. TweetLID is a corpus of tweets annotated for language identification. It contains 35K tweets in 6 languages (English, Spanish, Portuguese, Basque, Catalan, Galician). Each tweet is annotated with the language (or languages) the tweet is written in.

The corpus is released under the Creative Commons License (CC BY), and it is available for download in the following link: http://komunitatea.elhuyar.org/tweetlid/files/2014/10/TweetLID_corpusV1.zip

If you use this corpus, please cite the following paper:

- Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza A., Fresno, V. (2014). Overview of tweetlid: Tweet language identification at sepln 2014. Proceedings of the TweetLID Worshop at SEPLN2014. Girona. pp. 1-11. ISSN: 1613-0076.

You can find more information about the corpus and the shared task, in the workshop website or in the proceedings (http://ceur-ws.org/Vol-1228/).

For any further questions or suggestions do not hesitate to contact us at tweetlid at elhuyar.com

Regards,

TweetLID organizers.

--

*Iņaki San Vicente Roncal* I+G IKERTZAILEA / R&D RESEARCHER

i.sanvicente at elhuyar.com | <i.sanvicente at elhuyar.com> <i.sanvicente at elhuyar.com>inaki.sanvicente at ehu.es | <http://scholar.google.es/citations?user=eb_xVO4AAAAJ&hl=en> <https://www.researchgate.net/profile/Inaki_San_Vicente/> tel. Elhuyar: 943363040 | luzp.: 225 tel. Ixa: 943015110 | 314 bulegoa

Zelai Haundi, 3. Osinalde industrialdea 20170 Usurbil

*www.elhuyar.org* <http://www.elhuyar.org>* | **ixa.si.ehu.es * <http://ixa.si.ehu.es>

Mezu honek, baita erantsitako edozein agirik ere, isilpeko informazioa izan dezake. Informazio hori jasotzeko baimena izendatutakoak baino ez du. Zu ez bazara adierazitako hartzailea, indarrean dagoen legeriaren arabera debekatuta daukazu informazio hori baimenik gabe erabili, hedatu eta/edo kopiatzea. Mezu hau errakuntza baten ondorioz jaso baduzu, jakinarazi bidaltzaileari, eta ezaba ezazu. Eskerrik asko.

Ez inprimatu mezu hau ezinbestekoa ez bada. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5933 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141001/997424ff/attachment.txt>



More information about the Corpora mailing list