We are pleased to announce the availability of the
Hamburg Dependency Treebank.
The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank available (at the date of its publication). It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures. All in all, the HDT contains about 3.8 million manually annotated tokens.
The HDT consists of three parts:
- manually annotated and checked for consistency with DECCA
(part A, 101,999 sentences)
- manually annotated but not checked with DECCA
(part B, 104,795 sentences)
- automatically parsed with WCDG
(part C, 55,027 sentences)
The HDT is free for academic use, the annotations (but not the texts) are licensed under CC-BY-SA 4.0.
The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The mapping from sentences to articles and authors is retained, allowing e.g. analysis of individual style.
You can obtain the HDT by visiting http://hdl.handle.net/11022/0000-0000-7FC7-2
The paper describing the Hamburg Dependency Treebank: http://nbn-resolving.de/urn:nbn:de:gbv:18-228-7-2013
This paper also reports parser accuracy for Maltparser, TurboParser, and Mate parser, for various training sizes.
The annotation guidelines (in German): http://nbn-resolving.de/urn:nbn:de:gbv:18-228-7-2048
-- Arne Köhn, Wissenschaftlicher Mitarbeiter
AB Natürlichsprachliche Systeme