[Corpora-List] New version of masc_tagged for NLTK

Nancy Ide ide at cs.vassar.edu
Wed May 4 17:54:34 CEST 2016

*** Please note URL change from previous announcement ***

A new version of the POS-tagged MASC (Manually Annotated Sub-Corpus) NLTK is available for download from http://www.cs.vassar.edu/~ide/MASC/masc_tagged.tgz. This version correct several errors in tags for punctuation.

MASC is a corpus of 500K words of contemporary American English balanced across 19 genres. The individual genres can be accessed from NLTK as well as the entire corpus. The NLTK version of MASC is similar in format to the Brown Corpus, apart from including recent language samples from genres such as blog, tweets, and email.

masc_tagged is already included in NLTK (although due to a bug that has not yet been fixed by the NLTK developers, it is not automatically unzipped in the nltk_data/corpora directory when the NLTK data is downloaded, and has to be unzipped manually). Users can replace the version in that directory with the downloaded version and access it as before (using “from nltk.corpus import masc_tagged”).


Nancy Ide Professor of Computer Science

Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA

tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide at cs.vassar.edu http://www.cs.vassar.edu/~ide


More information about the Corpora mailing list