[Corpora-List] Google's translations
redpony at umd.edu
Sun Mar 14 23:27:43 CET 2010
>> 3. Another interesting experiment is to let Google translate the German
>> word "Ufer" (meaning "bank", but only in the waterside sense) into Czech.
>> This gives "banky", which means "bank", but only in its financial sense.
>> This can be explained by the observation that Google always uses English as
>> interlingua (Ufer --> bank --> banky). If you directly translate e.g.
>> Spanish to French you will get exactly the same result as when you first
>> translate Spanish into English, and then translate the English output into
>> Obviously, even for Google it is too costly to generate and maintain 52 *
>> 51 = 2651 translation models for all the supported language pairs. Or is it
>> that they have found that X to English to Y always performs better than X to
>> Y because there is so much more data available between English and X or Y
>> than between X and Y?
> That is a fascinating observation. Conventional wisdom has it that going
> through a pivot language is a
> poor idea, but that does seem to be what is happening for French-Spanish.
> Doubly weird because one would hope that the close family relation between
> French and Spanish would be helpful.
Some translation results using pivot languages turn out to be quite
surprising (they were to me, at least). It turns out that the optimal
translation path between languages in a statistical system is probably
a function of characteristics of the training data available to train
the systems for individual language pairs. See, for example, Section
462 Machine Translation Systems for Europe, Philipp Koehn, Alexandra
Birch and Ralf Steinberger, MT Summit XII, 2009
Their statistical systems do better translating European legalese by
pivoting through English than using more direct routes, presumably
because the legalese training data was translated in this way (by
humans). In other words, while there is presumably some good "direct"
translation between closely related languages, it's not always
learnable by statistical systems from the available training data.
So, going through English may be a good idea, not just because it
means you have to build fewer systems.
More information about the Corpora