*** The GUM Corpus - Release 4.0.0 ***
*** Georgetown University Multilayer corpus ***
We are pleased to announce the first release of series 4 of the Georgetown University Multilayer corpus (V4.0.0).
New: *four new genres*: academic, biographies, fiction, reddit forum discussions
GUM is an open source corpus of richly annotated English texts from multiple genres. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
This is the first version of GUM series 4, containing 85K tokens annotated for:
- Multiple POS tags (100% manual gold PTB, extended PTB, CLAWS5 and UPOS)
- Manually corrected lemmatization
- Sentence segmentation and rough speech act (manual)
- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
- Constituent and dependency syntax (manually corrected Stanford Dependencies, automatic conversion to Universal Dependencies and PTB parses from gold tags)
- Information status (given, accessible, new)
- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and bridging)
- Discourse parses in Rhetorical Structure Theory
Note on reddit data: token text is not contained in the release but can be downloaded with an included script.
For more information and to search or download the corpus online, see:
Dr. Amir Zeldes
Asst. Prof. of Computational Linguistics
Department of Linguistics
1437 37th St. NW
Washington, DC 20057
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9197 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180301/df278053/attachment.txt>