[Corpora-List] GUM corpus V3.0.0 release

Amir Zeldes Amir.Zeldes at georgetown.edu
Fri Jan 13 19:39:37 CET 2017

(Apologies for cross-postings)

*** The GUM Corpus - Release 3.0.0 ***

*** Georgetown University Multilayer corpus ***

We are pleased to announce the release of version 3.0.0 of the Georgetown University Multilayer corpus!

GUM is an open source corpus of richly annotated English web texts from four text types (news, interviews, travel and how-to guides). The corpus is collected and expanded by students as part of the Computational Linguistics curriculum at Georgetown University. The selection of text types is meant to represent different communicative purposes, coming from Creative Commons licensed sources, so that new texts can be annotated and published each year.

The latest version of the corpus contains 76 documents / 64K tokens annotated for:

- Multiple POS tags (100% manual gold PTB, extended and CLAWS), and lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Stanford Dependencies and automatic PTB parses from gold tags)

- Information status (given, accessible, new)

- Entity and coreference annotation (including singletons, appositions, cataphora and bridging)

- Rhetorical Structure Theory

For more information and to search or download the corpus online, see:





Dr. Amir Zeldes

Asst. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057


-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8533 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170113/66c7a1de/attachment.txt>

More information about the Corpora mailing list