[Corpora-List] Word embeddings from large corpora in Sketch Engine [12 languages; more to come]

Ayah Zirikly aya.zerikly at gmail.com
Fri Feb 2 18:11:49 CET 2018

Hi Milos,

Thank you for providing the pretrained word vectors. I am specifically interested in the Arabic version. I have a question in regards to Hamza manipulation, I noticed when searching for أحمد [Ahmad or >Hmd in Buckwalter] the results were empty as opposed to using احمد without hamza. Did you normalize all the hamza to regular alef?

Thank you,


On Fri, Feb 2, 2018 at 9:07 AM, Miloš Jakubíček < milos.jakubicek at sketchengine.co.uk> wrote:

> Dear all,
> this is to announce public availability of word embedding model calculated
> for large corpora that we have in Sketch Engine. At this moment, we have
> processed corpora for following languages:
> English, Arabic, Chinese, Czech, Danish, French, German, Italian, Korean,
> Portuguese, Russian, Spanish
> See https://embeddings.sketchengine.co.uk/ where you can find an online
> interface for executing word similarity queries (such as the infamous
> king-man+woman) and download the datasets. They can be used freely for
> non-commercial purposes, for the commercial ones do not hesitate to get
> back to me to work out a mutually suitable model of collaboration.
> We continue building further models as our spare computing capacity
> allows, and will continue publishing them. If you are interested in a
> particular language that is missing at this moment, let me know and I can
> try to prioritise (no guarantees though).
> The embeddings were calculated using FastText with various parameters and
> on various corpus attributes (word, lemma, lemma+PoS combination, lowercase
> etc.)
> We have had increasing amount of requests to obtain corpora from Sketch
> Engine for these purposes, so this is our response to that to support
> research in this area.
> Cheers,
> Milos Jakubicek
> CEO, Lexical Computing
> Brno, CZ | Brighton, UK
> http://www.lexicalcomputing.com
> http://www.sketchengine.co.uk
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3260 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20180202/41427844/attachment.txt>

More information about the Corpora mailing list