[Corpora-List] WordNet ignores function words ...

John F Sowa sowa at bestweb.net
Thu May 4 19:14:02 CEST 2017

Ricardo and Albretch,

> Maybe the work by Wang Ling [et al...] can help:

Yes. That article adds "two simple modifications" to word2vec "in order to generate embeddings more suited to tasks involving syntax."

As the authors say, those methods are helpful:
> With these models we obtain improvements in two mainstream NLP tasks,
> namely part-of-speech tagging and dependency parsing...

But those are just the early stages of NLP. You need much more for semantics, pragmatics, and reasoning about the content.

> "large" corpora (collections of such texts), you can arrive at a
> full description of the grammar of a language in a deductive way...
>How "large" is "large" enough?

The Rosetta stone had 14 lines of hieroglyphic text. That text plus the corresponding Greek was sufficient for *human* analysts to get started. Some of the glyphs were also used for their phonetic value. With the assumption that Coptic, as preserved in religious texts, was a descendant of ancient Egyptian, linguists were also able to reconstruct the phonemes of the ancient language.

> How large should a corpus be so that the deductive conclusions
> you mine out of it you can safely regard as valid?

That depends on what you mean by valid. Today's corpora have billions of words -- yet the currently popular machine learning methods can't understand language at the level of a child.

Today's ML methods do *perceptual* learning. That is very important, and it's sufficient for many applications.

But the linguists who deciphered ancient Egyptian did *cognitive* learning. An infant begins with perceptual learning, but quickly moves to cognitive learning. Fundamental distinction:

1. Perception answers the question "What do you see (or hear)?"

It needs large amounts of data, and it can't explain the answers.

2. Cognition answers the question "Why did you do that?" It

requires much less data, it can generate an explanation in

ordinary language, and it can answer follow-up questions.

For the distinction between perceptual learning and cognitive learning, see slides 30 to 43 of http://www.jfsowa.com/talks/cogmem.pdf


More information about the Corpora mailing list