[Corpora-List] GUM Corpus 6.0.0

Amir Zeldes Amir.Zeldes at georgetown.edu
Fri Mar 6 03:13:52 CET 2020

(Apologies for cross-postings)

*** The GUM Corpus - Release 6.0.0 ***

*** Georgetown University Multilayer corpus ***

The Corpling Lab at Georgetown University <http://corpling.uis.georgetown.edu/corpling/> is happy to announce the first release of series 6 of the Georgetown University Multilayer corpus (V6.0.0).

New in this version:

- 22 documents added (total tokens: 129,660)

- Discourse parses in Rhetorical Structure Theory now follow RST-DT guidelines

- 5 new discourse relations (means, manner, attribution, question and same-unit)

- Discourse dependency representation and lisp-style formats available

- Now using native Universal Dependencies syntax trees (not automatic conversion)

- Many manual corrections to lemmatization, POS and other consistency improvements

GUM is an open source corpus of richly annotated English texts from multiple genres: academic, bio, fiction, interview, news, travel, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

This is the first version of GUM series 6, containing nearly 130K tokens annotated for:

- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features

- Manually corrected lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags)

- Information status (given, accessible, new)

- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging)

- Discourse parses in Rhetorical Structure Theory

Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.

For more information and to search or download the corpus online, see:





Dr. Amir Zeldes

Assoc. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057


-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11497 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200305/9f7b9854/attachment.txt>

More information about the Corpora mailing list