[Corpora-List] Free ngrams (COW14): English, German, Spanish, Swedish

Roland Schäfer roland.schaefer at fu-berlin.de
Sun Jun 7 20:31:43 CEST 2015


--------------------------------------------------------------------

Please re-distribute this wherever you think it's appropriate. --------------------------------------------------------------------

We are pleased to announce the release of the first very large ngram databases derived from the giga-token COW14 web corpora. They are completely free (CC-BY) and can be downloaded without registration. We have applied no frequency thresholds whatsoever. In addition to the counted ngram lists, we offer raw versions such that everybody can create their own version. The raw ngrams also contain additional information (crawl year, top-level domain, country geolocation).

There are also English dependency bigrams (based on Malt parses) containing words, their heads, and the dependency relation between them.

For end-users, there are also word and lemma frequency lists with some convenient frequency measures, optionally with a frequency threshold of 10 (smaller files, easier handling).

--------------------------------------------------------------------

LICENSE AND REFERENCES

License Creative Commons Attribution 4.0 International

References http://corporafromtheweb.org/category/cow-citation/

Please tell us whenever you publish work based on COW:

https://webcorpora.org/publication/

DOWNLOAD

http://hpsg.fu-berlin.de/cow/ngrams/

http://hpsg.fu-berlin.de/cow/frequencies/

ORIGIN AND ORIGINAL CORPUS SIZES

The ngrams are derived from the COW14AX sentence-shuffled corpora.

Information http://corporafromtheweb.org/category/corpora/

Interface https://webcorpora.org/

English 9,578,828,861 tokens (International)

German 11,660,894,000 tokens (AT, CH, DE)

Spanish 3,680,794,644 tokens (International)

Swedish 4,842,753,707 tokens (FI, SV)

FREQUENCY LISTS

Languages English, German, Spanish, Swedish

Versions Lemma, Lemma + POS, Word, Word + POS

Thresholds no threshold; raw frequency > 9

Measures raw frequency, absolute rank, frequency per million,

log-frequency per million, frequency band

NGRAMS

N 1 .. 5

Languages English, German, Spanish, Swedish

Versions Raw, Word, Word + POS, Lemma (except Swedish)

DEPENDENCY BIGRAMS

Languages English (German soon, maybe Swedish)

Versions Raw, Word, Word + POS, Lemma, Lemma + POS

--------------------------------------------------------------------



More information about the Corpora mailing list