[Corpora-List] Bilingual Dictionary from Comparable Corpora

Alberto Simões albie at alfarrabio.di.uminho.pt
Sat Oct 11 10:52:53 CEST 2014


Dear Javid,

For parallel corpora, NATools can handle that size of corpora (it works in chunks). Nevertheless, it is not prepared to handle comparable corpora :(

On 11/10/14 09:34, javid dadashkarimi wrote:
> Hi everybody,
> "Thank you so much for your useful suggestions",
> However, the size of the our corpora is almost 20 GB and we have memory
> problem. Indeed, we have 300K target unique words and 750K alignments
> and we can not load document-word or word-alignments matrices in the
> memory. How can I use the tools efficiently?
> Best,
> Javid
>
> On Thu, Oct 9, 2014 at 2:34 AM, Reinhard Rapp <reinhardrapp at gmx.de
> <mailto:reinhardrapp at gmx.de>> wrote:
>
> Dear all,
>
> I would like to point to the work done by Tomas Mikolov, Quoc V. Le,
> and Ilya Sutskever:
>
> http://arxiv.org/abs/1309.4168
>
> It seems that there is code available for this (see footnote 1) of
> the paper.
>
> There is also a popular science article on this approach:
>
> http://www.technologyreview.__com/view/519581/how-google-__converted-language-__translation-into-a-problem-of-__vector-space-mathematics/
> <http://www.technologyreview.com/view/519581/how-google-converted-language-translation-into-a-problem-of-vector-space-mathematics/>
>
> Together with Michael Zock I organized a shared task on
> multi-stimulus association at the COLING 2014 workshop on Cognitive
> Aspects of the Lexicon (CogALex-IV) and from this I know that
> systems using Mikolov et al.'s neural network-based language
> modelling approach perform extremely well in the monolingual case
> (see e.g. the first 4 papers in the workshop proceedings to be found
> at http://aclanthology.info/__events/cogalex-2014#W14-47
> <http://aclanthology.info/events/cogalex-2014#W14-47>).
>
> Let me also mention that we (Pierre Zweigenbaum, Serge Sharoff, and
> myself) are currently serving as guest editors for a special issue
> of the Journal of Natural Language Engineering (JNLE) on the topic
> of "Machine Translation Using Comparable Corpora":
> http://comparable.limsi.fr/__jnle-bucc2015/
> <http://comparable.limsi.fr/jnle-bucc2015/> (submissions welcome,
> deadline Dec. 1, 2014). If you are working in this field, but will
> not be able to submit a paper yourself, please let us know about
> your work (especially if it is not already mentioned in the
> introductory chapter of the volume "Building and Using Comparable
> Corpora", see Serge's previous e-mail in this thread) as we are
> preparing an overview article which aims to be as comprehensive as
> possible.
>
> Many thanks and kind regards,
>
> Reinhard
>
> -----Ursprüngliche Nachricht----- From: inguna.skadina at lumii.lv
> <mailto:inguna.skadina at lumii.lv>
> Sent: Tuesday, October 7, 2014 8:48 AM
> To: IngunaSkadiņa
> Cc: corpora at uib.no <mailto:corpora at uib.no> ;
> gate-users-request at lists.__sourceforge.net
> <mailto:gate-users-request at lists.sourceforge.net>
> Subject: Re: [Corpora-List] Bilingual Dictionary from Comparable Corpora
>
> Dear Javid,
>
>
> The ACCURAT toolkit (http://accurat-project.eu/) allows to identify
> semi-parallel sentences in comparable corpora and extract
> dictionary/translation table from them (with support of GIZA+++).
>
> I hope, you will find it useful.
>
> Best wishes,
> Inguna Skadiņa
>
> Citējot javid dadashkarimi <javiddadashkarimi at gmail.com
> <mailto:javiddadashkarimi at gmail.com>>:
>
> Hi,
> Is there any tool for extracting probabilistic bilingual
> dictionary for a
> bilingual comparable corpora? Does Moses support such a task?
> Best,
> Javid
>
>
>
>
>
>
>
>
> _________________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/__corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/__listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
> _________________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/__corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/__listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



More information about the Corpora mailing list