[Corpora-List] GUM Corpus V8 - new data, discourse relations and more

Amir Zeldes Amir.Zeldes at georgetown.edu
Mon Jan 31 23:34:47 CET 2022

(Apologies for cross-postings)

*** The GUM Corpus - Release 8.0.0 ***

*** Georgetown University Multilayer corpus ***

Corpling at GU <https://corpling.uis.georgetown.edu/corpling/> is happy to announce the first release of series 8 of the Georgetown University Multilayer corpus (GUM V8.0.0):


New in this version:

- 25 documents added including more conversational data (total tokens: 180,849):

- New RST discourse relations, now covering 32 labels in a two level hierarchy, as discourse constituent and dependency trees

- More fine-grained, 6-way information status annotations for all entity mentions

- Now distinguishing 7 types of coreference relations, incl. new discourse deixis and non-identity predication in addition to older types (apposition, cataphora, etc.) and explicit annotation of singletons

- More consistent UD syntax, including a new obl:agent relation for passive agents

- New Wikidata identifiers for wikification layer (including nested and pronominal mentions)

- More comprehensive conllu format now includes TEI XML structure, information status, coref types and more

- Many corrections to all annotation layers

GUM is an open source corpus of richly annotated English texts from multiple genres: academic, bio, conversation, fiction, interview, news, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

This is the first version of GUM series 8, containing roughly 180K tokens annotated for:

- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features

- Manually corrected lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels)

- Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)

- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging)

- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions

- Discourse parses in Rhetorical Structure Theory and discourse dependencies

Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.

For more information and to search or download the corpus online, see the corpus website <https://corpling.uis.georgetown.edu/gum/> .

Best wishes,

The GUM team

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5891 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20220131/e95e9b5e/attachment.txt>

More information about the Corpora mailing list