[Corpora-List] New MASC data and annotations available

Nancy Ide ide at cs.vassar.edu
Thu Mar 24 21:34:17 CET 2011


Manually Annotated Sub-Corpus

http://www.anc.org/MASC

*** All downloads available at http://www/anc.org/MASC/Download.html *** MASC1 (82K words with multiple layers of annotation) is also available from the Linguistic Data Consortium

MASC texts -------------- The full 500K of MASC spoken and written texts are now available for download from the MASC website. The corpus comprises roughly 25K words from each of 20 different genres:

Genre No. files No. Words Pct corpus Court transcript 2 30052 6% Debate transcript 2 32325 6% Email 78 27642 6% Essay 7 25590 5% Fiction 5 31518 6% Gov't documents 5 24578 5% Journal 10 25635 5% Letters 40 23325 5% Newspaper/newswire 41 23545 5% Non-fiction 4 25182 5% Spoken 11 25783 5% Technical 7 25426 5% Travel guides 7 26708 5% Twitter 2 24180 5% Blog 21 28199 6% ficlets 5 26299 5% movie script 2 28240 6% spam 110 23490 5% jokes 16 26582 5% TOTAL 375 504299

*************************************************************************************************************** We invite contribution of linguistic annotations of any kind and in any format of any portion of the data. Contributed annotations will be made available to the community in both their original format and in GrAF format compatible with other annotations of the data. ***************************************************************************************************************

New Annotations --------------------- We have also made available Propbank annotations of a 40K subset of MASC that has been heavily annotated by multiple groups for many different linguistic phenomena. These are currently distributed in the original Propbank format (together with the Penn Treebank annotations on which they rely), The GrAF version of the Propbank annotations will be made available this summer.

+-----------------------------------------------------------------------------------------+
| MASC IS DEVELOPED AND DISTRIBUTED BY THE AMERICAN NATIONAL CORPUS PROJECT, WHICH IS |
| COMMITTED TO PROVIDING OPEN DATA. ALL MASC DATA AND ANNOTATIONS ARE FREELY DISTRIBUTED |
| AND MAY BE USED AND REDISTRIBUTED FOR ANY PURPOSE, INCLUDING COMMERCIAL. |
+-----------------------------------------------------------------------------------------+

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8895 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20110324/74ad3a94/attachment.txt>



More information about the Corpora mailing list