[Corpora-List] DECOW14 : 11, 7 billion token German web corpus available

Roland Schäfer roland.schaefer at fu-berlin.de
Mon Feb 23 16:23:28 CET 2015


The DECOW14AZ German web corpus is now available for free to people working the academia. The released corpus is a sentence shuffle which contains 11,7 billion tokens after aggressive cleaning. It is derived from a non-shuffled corpus of 20.5 billion tokens.

Meta data:

- source IP and URL - boilerplate and document quality scores - crawl date and last-modified date - geolocation: country and city - score(s) that indicate quasi-spontaneous registers

Linguistic annotation:

- POS tag (TreeTagger/STTS) - lemma (enhanced TreeTagger) - named entity (Stanford/Pado) - morphology (mate-tools)

Find more information about DECOW14A here:

http://corporafromtheweb.org/decow14/ http://hpsg.fu-berlin.de/cow/

You can query COW corpora (currently: Dutch, English, German, Swedish) and download them using our own Colibri² web front-end:

https://webcorpora.org/

Best regards, Roland Schäfer



More information about the Corpora mailing list