[Corpora-List] Man bites dog
joregan at gmail.com
Mon Nov 21 13:01:59 CET 2011
On 21 November 2011 03:15, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
> In LILT 6 (http://elanguage.net/journals/index.php/lilt/issue/current),
> "Zipf's Law and l'Arbitraire du Signe," Martin Kay discusses statistical MT,
> and says (p.22):
> Notice that a language model would, and should, guarantee
> that the French “homme mord chien” would be translated into
> English as “dog bites man”, rather than “man bites dog”,
> which is what it really means.
> I once proposed this exact example (with Spanish rather than French) to a
> computational linguist who knew more about MT than I do. (People who know
> more about MT than I do are quite common. Ok, they're quite common among
> computational linguists :-).) That person suggested I needed to learn more
> about MT.
> It would be nice to find myself making the same mistake that Martin Kay
> made. It would be even nicer if it weren't a mistake.
> Is Kay's claim correct? The context is of course pure statistical MT, not
> hybrid rule/ statistical systems. Assume that the pair "homme mord chien"/
> "man bites dog" never occurs in the training data, but that the reverse does
> (or at least that "dog bites man" appears on the English side, presumably
> with some significant frequency).
That idea overlooks how statistical reordering works, and assumes a
'bag of words' based method; it also presumes that the bigrams 'man
bites' and 'bites dog' never occur. More importantly, it assumes that
'dog bites man' is a more frequent trigram in English (i.e., the
target language model), which doesn't seem to be true
which makes sense in hindsight, when you consider the idiomatic value
of 'man bites dog'.
It has a sort of metaphorical truth, regarding SMT's difficulties with
novelty, but it's not literally true - file it away with 'the meat is
rotten, but the vodka is good' :).
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you
More information about the Corpora