[Corpora-List] Word embeddings from large corpora in Sketch Engine [12 languages; more to come]

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Fri Feb 2 15:07:14 CET 2018


Dear all,

this is to announce public availability of word embedding model calculated for large corpora that we have in Sketch Engine. At this moment, we have processed corpora for following languages:

English, Arabic, Chinese, Czech, Danish, French, German, Italian, Korean, Portuguese, Russian, Spanish

See https://embeddings.sketchengine.co.uk/ where you can find an online interface for executing word similarity queries (such as the infamous king-man+woman) and download the datasets. They can be used freely for non-commercial purposes, for the commercial ones do not hesitate to get back to me to work out a mutually suitable model of collaboration.

We continue building further models as our spare computing capacity allows, and will continue publishing them. If you are interested in a particular language that is missing at this moment, let me know and I can try to prioritise (no guarantees though).

The embeddings were calculated using FastText with various parameters and on various corpus attributes (word, lemma, lemma+PoS combination, lowercase etc.)

We have had increasing amount of requests to obtain corpora from Sketch Engine for these purposes, so this is our response to that to support research in this area.

Cheers, Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1920 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20180202/52077f4e/attachment.txt>



More information about the Corpora mailing list