[Corpora-List] Broad Twitter Corpus v1.0: release

Leon Derczynski leonderczynski at gmail.com
Thu Feb 1 20:33:54 CET 2018


We are proud to announce the release of the Broad Twitter Corpus.

== About The Broad Twitter Corpus is a dataset of tweets (147K tokens) collected over a range of times, places and social uses. The goal is to represent a broad range of activities, giving a dataset more representative of the language used in this hardest of social media formats to process.

Further, the BTC is annotated for named entities. The entities and the crowd annotations are all provided with the corpus, as well as (where possible) the raw twitter JSON.

You can find the full story behind the corpus at http://www.aclweb.org/antho logy/C16-1111

== Use

The BTC is released as CC-BY 4.0. If you use this data, you should cite the accompanying paper:

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource. Leon Derczynski, Kalina Bontcheva, and Ian Roberts. Proceedings of COLING, pages 1169-1179, 2016.

== Access

Data and documentation can be found at https://github.com/GateNLP/ broad_twitter_corpus . This corpus is also a @github repo, so please send pull requests and create issues there. We anticipate tagging milestones of the dataset, to retain reproducibility. The repository also contains the paper describing the corpus.

== Slides The slideshow describing and motivating the corpus is here:

https://www.slideshare.net/leonderczynski/broad-twitter-co rpus-a-diverse-named-entity-recognition-resource

Enjoy!

Leon, Kalina and Ian @ Sheffield -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8263 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20180201/5e56af5e/attachment.txt>



More information about the Corpora mailing list