[Corpora-List] GUM corpus 3.2.0 with Universal Dependencies

Amir Zeldes Amir.Zeldes at georgetown.edu
Sat Feb 3 19:44:23 CET 2018

(Apologies for cross-postings)

*** The GUM Corpus - Release 3.2.0 ***

*** Georgetown University Multilayer corpus ***

We are pleased to announce the final release of series 3 of the Georgetown University Multilayer corpus (V3.2.0). We look forward to releasing new data in GUM series 4 soon!

New: *Universal Dependencies* version

GUM is an open source corpus of richly annotated English web texts from multiple genres. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

This is the final versions of GUM series 3, containing 64K tokens annotated for:

- Multiple POS tags (100% manual gold PTB, extended PTB, CLAWS5 and Universal POS), and corrected lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Stanford Dependencies, automatic conversion to Universal Dependencies, as well as automatic PTB parses from gold tags)

- Information status (given, accessible, new)

- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and bridging)

- Discourse parses according to Rhetorical Structure Theory

For more information and to search or download the corpus online, see:





Dr. Amir Zeldes

Asst. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057


-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8754 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20180203/897f5a7a/attachment.txt>

More information about the Corpora mailing list