[Corpora-List] Challenging Languages for MT

Christian Chiarcos christian.chiarcos at web.de
Fri Jun 19 01:40:08 CEST 2015

Dear Sheila,

generally speaking, the most severe bottleneck for statistical machine translation persists for languages for which insufficient amounts of parallel training data is available, i.e., most less-resourced languages. Other than that, every change in word order is problematic, especially if it pertains topicalization of non-subject arguments (German "Die Kuh melkt Mafia", i.e., "It is the cow Maria is milking"= "Maria is milking the cow (but not XYZ)", but not "The cow is millking Maria"), hence the recent interest in semantic machine translation which allows to keep track of semantic roles during translation. Particularly problematic are languages for which word segmentation cannot be easily be established, als words are the basis of most translation models, e.g., Thai, and (to a latter degree) Japanese. Character-based machine translation can almost approximate word-based machine translation models (cf. Neubig et al. 2012, ACL-2012), but AFAIK, systematically beat them only for closely related language pairs for which transliteration represents a reasonable fall-back solution for translation (several experiements by Jörg Tiedemann and Preslav Nakov, easily to be found in the ACL Anthology).

Best, Christian

On 18.06.2015, at 20:22, Sheila Castilho M de Sousa <castils3 at mail.dcu.ie> wrote:

> Dear All,
> I would appreciate if you could point out any studies on most/least challenging languages for MT. I need some references on what has been generally considered to be easy/difficult for MT so far.
> Thank you.
> Regards,
> Sheila Castilho
> PhD Candidate,
> ADAPT Centre
> School of Applied Language and Intercultural Studies,
> Centre for Translation and Textual Studies,
> Dublin City University
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3174 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150619/4fc0f0b1/attachment.txt>

More information about the Corpora mailing list