On 14 Mar 2010, at 22:27, Chris Dyer wrote:
>>> 3. Another interesting experiment is to let Google translate the German
>>> word "Ufer" (meaning "bank", but only in the waterside sense) into Czech.
>>> This gives "banky", which means "bank", but only in its financial sense.
>>> This can be explained by the observation that Google always uses English as
>>> interlingua (Ufer --> bank --> banky). If you directly translate e.g.
>>> Spanish to French you will get exactly the same result as when you first
>>> translate Spanish into English, and then translate the English output into
>>> French.
>>> Obviously, even for Google it is too costly to generate and maintain 52 *
>>> 51 = 2651 translation models for all the supported language pairs. Or is it
>>> that they have found that X to English to Y always performs better than X to
>>> Y because there is so much more data available between English and X or Y
>>> than between X and Y?
>>
>> That is a fascinating observation. Conventional wisdom has it that going
>> through a pivot language is a
>> poor idea, but that does seem to be what is happening for French-Spanish.
>> Doubly weird because one would hope that the close family relation between
>> French and Spanish would be helpful.
>
> Some translation results using pivot languages turn out to be quite
> surprising (they were to me, at least). It turns out that the optimal
> translation path between languages in a statistical system is probably
> a function of characteristics of the training data available to train
> the systems for individual language pairs. See, for example, Section
> 6.1 in
>
> 462 Machine Translation Systems for Europe, Philipp Koehn, Alexandra
> Birch and Ralf Steinberger, MT Summit XII, 2009
> http://www.mt-archive.info/MTS-2009-Koehn-1.pdf
>
> Their statistical systems do better translating European legalese by
> pivoting through English than using more direct routes, presumably
> because the legalese training data was translated in this way (by
> humans). In other words, while there is presumably some good "direct"
> translation between closely related languages, it's not always
> learnable by statistical systems from the available training data.
> So, going through English may be a good idea, not just because it
> means you have to build fewer systems.
>
> -Chris
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora