[Corpora-List] WordNet ignores function words ...

maxwell maxwell at umiacs.umd.edu
Thu May 4 22:59:58 CEST 2017


On 2017-05-04 13:14, John F Sowa wrote:
> Albretch
>> How large should a corpus be so that the deductive conclusions
>> you mine out of it you can safely regard as valid?
>
> That depends on what you mean by valid. Today's corpora have
> billions of words -- yet the currently popular machine learning
> methods can't understand language at the level of a child.
>
> Today's ML methods do *perceptual* learning. That is very
> important, and it's sufficient for many applications.
>
> But the linguists who deciphered ancient Egyptian did *cognitive*
> learning. An infant begins with perceptual learning, but quickly
> moves to cognitive learning. Fundamental distinction:
>
> 1. Perception answers the question "What do you see (or hear)?"
> It needs large amounts of data, and it can't explain the answers.
>
> 2. Cognition answers the question "Why did you do that?" It
> requires much less data, it can generate an explanation in
> ordinary language, and it can answer follow-up questions.

As someone who got into computational linguistics in the days when you had to be a linguist to be a computational linguist, I've often thought about a "showdown" between a linguist and a machine learning system. Give both a smallish parallel corpus (the New Testament of the Bible, for example) in some language that neither "knows", and (relatively) unlimited time, and ask them to come up with a parser (or just a morphological parser) of some language. (Since most linguists don't build parsers, the linguist could get help from a programmer.) At the end, compare the two on some held-out data. My suspicion is that the human linguist would come out well, maybe even better. But as I say, I'm biased; I'm a linguist at heart.

The limitation to a smallish corpus of course helps the linguist, who probably can't make efficient use of a large corpus. But it also realistic, in the sense that for many many languages of the world, that's all you have, and for some, all you ever will have.

Mike Maxwell

University of Maryland



More information about the Corpora mailing list