[Corpora-List] v3.0 release of the CRAFT Corpus

Bada, Mike Mike.Bada at ucdenver.edu
Thu May 24 01:47:27 CEST 2018

Hi all,

The Hunter Lab at the University of Colorado School of Medicine is pleased to announce the v3.0 release of the Colorado Richly Annotated Full Text (CRAFT) Corpus, a collection of 67 full-length, open-access biomedical journal articles and gold-standard annotations of them along multiple axes, specifically:

* sentence segmentation

* tokenization

* part-of-speech tagging

* dependency structures

* treebanking

* markup of coreferential noun phrases

* markup of document sections and typography

* annotations of concepts represented in ten Open Biomedical Ontologies (the Chemical Entities of Biological Interest ontology, Cell Ontology, Gene Ontology Biological Process, Gene Ontology Cellular Component, Gene Ontology Molecular Function, Molecular Process Ontology, NCBI Taxonomy, Protein Ontology, Sequence Ontology, and the Uberon anatomical ontology)

Also included are the versions of the ontologies used for the concept annotations and various text files useful for comparing automatically generated concept annotations to this gold standard. There are many changes compared to v2.0, including:

* The concept annotations for all 8 of the ontologies used in previous versions of the corpus have been updated using the classes of newer versions of these ontologies, resulting in substantial increases in the annotation counts for some of the ontology passes. Additionally, extension classes of the ontologies have been created and extensively used for annotation. The concept annotations for each ontology are packaged into sets created without any extension classes and sets augmented with annotations using extension classes, as well as various class mapping files. This is discussed at length in ontology-concepts/README.md within the distribution.

* Concept annotations using the Molecular Process Ontology and the Uberon anatomical ontology have been created for the articles of the corpus. Extension classes and annotations have also been created for these.

* The Gene Ontology Biological Process and Gene Ontology Molecular Function concept annotations have been modularized into their own proper respective annotation sets.

* The structure of the NCBITaxon annotations has been changed, now with each annotation directly specified with the appropriate NCBITaxon ID rather than as an attribute. This new structure is consistent with the concept annotations created with the other ontologies.

* The Entrez Gene annotations have been removed from the distribution, as we believe they do not match the quality of the other concept annotations, and we recommend that they not be used.

* The concept annotation span guidelines have been slightly modified; this is discussed in ontology-concepts/README.md within the distribution.

* The concept annotations are no longer distributed in the AO RDF or GENIA XML formats, nor as Protégé-Frames projects. However, they are now also provided in the Knowtator 2 format (which we believe is a more intuitive format than the original Knowtator format) as well as the brat format. Additionally, we have removed the previously included XML files whose offsets are based on Unicode code points, as we have analyzed these to be identical to those based on Java code points and therefore not needed.

* The coreference annotations are provided in Knowtator, Knowtator 2, and UIMA XMI formats in addition to the previously available brat format.

* The directory structure of the distribution has been substantially changed. Along with the journal articles, the top level is organized by annotation type (coreference, dependency structures, ontology concepts, sentences/tokens/parts of speech, sections/typography, and treebanking). All available formats for a given annotation type are organized under it.

The v3.0 distribution is available at the new CRAFT GitHub site at:


There’s a top-level README, and we strongly encourage users who are interested in using the concept annotations to also go through the README in the ontology-concepts directory. Previous versions of the corpus are still available at:


The CRAFT annotations are free to use under the terms of the CC BY 3.0 license. We hope the new release is helpful to the community, and please let us know if you have any questions.

Cheers, Mike Bada, on behalf of the Hunter Lab -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 18361 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180523/e5bf248a/attachment.txt>

More information about the Corpora mailing list