[Corpora-List] Word embeddings from large corpora in Sketch Engine [12 languages; more to come]

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Fri Feb 2 22:07:05 CET 2018

Dear Ayah,

I asked my colleagues and apparently yes, the tagger removes all diacritics.

Best Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk

On 2 February 2018 at 18:11, Ayah Zirikly <aya.zerikly at gmail.com> wrote:

> Hi Milos,
> Thank you for providing the pretrained word vectors. I am specifically
> interested in the Arabic version.
> I have a question in regards to Hamza manipulation, I noticed when
> searching for أحمد [Ahmad or >Hmd in Buckwalter] the results were empty as
> opposed to using احمد without hamza. Did you normalize all the hamza to
> regular alef?
> Thank you,
> Ayah
> On Fri, Feb 2, 2018 at 9:07 AM, Miloš Jakubíček <
> milos.jakubicek at sketchengine.co.uk> wrote:
>> Dear all,
>> this is to announce public availability of word embedding model
>> calculated for large corpora that we have in Sketch Engine. At this moment,
>> we have processed corpora for following languages:
>> English, Arabic, Chinese, Czech, Danish, French, German, Italian, Korean,
>> Portuguese, Russian, Spanish
>> See https://embeddings.sketchengine.co.uk/ where you can find an online
>> interface for executing word similarity queries (such as the infamous
>> king-man+woman) and download the datasets. They can be used freely for
>> non-commercial purposes, for the commercial ones do not hesitate to get
>> back to me to work out a mutually suitable model of collaboration.
>> We continue building further models as our spare computing capacity
>> allows, and will continue publishing them. If you are interested in a
>> particular language that is missing at this moment, let me know and I can
>> try to prioritise (no guarantees though).
>> The embeddings were calculated using FastText with various parameters and
>> on various corpus attributes (word, lemma, lemma+PoS combination, lowercase
>> etc.)
>> We have had increasing amount of requests to obtain corpora from Sketch
>> Engine for these purposes, so this is our response to that to support
>> research in this area.
>> Cheers,
>> Milos Jakubicek
>> CEO, Lexical Computing
>> Brno, CZ | Brighton, UK
>> http://www.lexicalcomputing.com
>> http://www.sketchengine.co.uk
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4368 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20180202/6742fc94/attachment.txt>

More information about the Corpora mailing list