[Corpora-List] GUM Corpus 4.0.0 - with new genres

Amir Zeldes Amir.Zeldes at georgetown.edu
Thu Mar 1 19:14:22 CET 2018

(Apologies for cross-postings)

*** The GUM Corpus - Release 4.0.0 ***

*** Georgetown University Multilayer corpus ***

We are pleased to announce the first release of series 4 of the Georgetown University Multilayer corpus (V4.0.0).

New: *four new genres*: academic, biographies, fiction, reddit forum discussions

GUM is an open source corpus of richly annotated English texts from multiple genres. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

This is the first version of GUM series 4, containing 85K tokens annotated for:

- Multiple POS tags (100% manual gold PTB, extended PTB, CLAWS5 and UPOS)

- Manually corrected lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Stanford Dependencies, automatic conversion to Universal Dependencies and PTB parses from gold tags)

- Information status (given, accessible, new)

- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and bridging)

- Discourse parses in Rhetorical Structure Theory

Note on reddit data: token text is not contained in the release but can be downloaded with an included script.

For more information and to search or download the corpus online, see:





Dr. Amir Zeldes

Asst. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057


-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9197 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180301/df278053/attachment.txt>

More information about the Corpora mailing list