[Corpora-List] Google's translations

Chris Dyer redpony at umd.edu
Sun Mar 14 23:27:43 CET 2010



>> 3. Another interesting experiment is to let Google translate the German
>> word "Ufer" (meaning "bank", but only in the waterside sense) into Czech.
>> This gives "banky", which means "bank", but only in its financial sense.
>> This can be explained by the observation that Google always uses English as
>> interlingua (Ufer --> bank --> banky). If you directly translate e.g.
>> Spanish to French you will get exactly the same result as when you first
>> translate Spanish into English, and then translate the English output into
>> French.
>> Obviously, even for Google it is too costly to generate and maintain 52 *
>> 51 = 2651 translation models for all the supported language pairs. Or is it
>> that they have found that X to English to Y always performs better than X to
>> Y because there is so much more data available between English and X or Y
>> than between X and Y?
>
> That is a fascinating observation. Conventional wisdom has it that going
> through a pivot language is a
> poor idea, but that does seem to be what is happening for French-Spanish.
> Doubly weird because one would hope that the close family relation between
> French and Spanish would  be helpful.

Some translation results using pivot languages turn out to be quite surprising (they were to me, at least). It turns out that the optimal translation path between languages in a statistical system is probably a function of characteristics of the training data available to train the systems for individual language pairs. See, for example, Section 6.1 in

462 Machine Translation Systems for Europe, Philipp Koehn, Alexandra Birch and Ralf Steinberger, MT Summit XII, 2009 http://www.mt-archive.info/MTS-2009-Koehn-1.pdf

Their statistical systems do better translating European legalese by pivoting through English than using more direct routes, presumably because the legalese training data was translated in this way (by humans). In other words, while there is presumably some good "direct" translation between closely related languages, it's not always learnable by statistical systems from the available training data. So, going through English may be a good idea, not just because it means you have to build fewer systems.

-Chris



More information about the Corpora mailing list