[Corpora-List] WordNet ignores function words ...

Albretch Mueller lbrtchx at gmail.com
Thu May 4 13:09:22 CEST 2017

On 5/4/17, John F Sowa <sowa at bestweb.net> wrote:
> On 5/3/2017 5:02 PM, Albretch Mueller wrote:
>> Do you/does anyone in our list know of ways to associate "links"
>> (what they call "stop words") to syntax
> Yes. It's called parsing. Any method that throws away stop words
> and inflections (e.g. word vectors) throws away all the structure
> in language. No such method can do language understanding.

but if I understood well what parse trees are about, they are created by the top-down process of analyzing sentences and phrases. What I meant is that by going from an axiomatically bare definition of what a text is:

1) a sequence of (elements interpreted as) signs

2) starting with a semantic end

3) ending with a semantic start

4) consciously used for communicating

5) by social parties

and "large" corpora (collections of such texts), you can arrive at a full description of the grammar of a language in a deductive way.

That "large" is quite a bit flaky. How "large" is "large" enough, that is, "representative" given a certain purpose (among them describing the grammar of a language)? In fact, I haven't run into any kind of attempts at, at least somewhat measurably, defining how "representative" a given corpus is. Is that even a valid question or an artifact articulated by my promiscuous mind? Could you possibly determine with syntactic methods (mathematical equations, algorithms, axiomatic theories with some logical closure) the meaning of "large enough" when it comes to corpora?

Last time I tried discussing such questions about "Sachlichkeit" in corpora research, I got answers such as "a corpus is what the client wants" (and "the client is always right" ;-))

Especially now that we are living (hopefully through) our nonsensical AI, big data era in which some types of folks think (IMO very wrongly) that they can run societies/the world based on some kind of NSA/five-eye ‹berkorpus. Those NSA types apparently believe in their "collect it ... all, all, all ..." mantra. I think their problem is an existential, purely pragmatic one. They need to "collect it all", namely; because they have no brains (in fact, only people without functioning brains, which they need to hide deep in secrecy, military badges ... would believe such AI ‹berkorpus b#llsh!t possible, that is why they need to "collect it all", they can't think)

How do we know that we have a "large enough" corpus to find answers to specific questions? Or rephrasing the question from another angle; how do we know if/when a given corpus is failing us?; at which point has it passed its useful time, as it invariably happens with all semiotic systems, the corpus we are using as a tool start playing us and we start "seeing" what we need/want to see in it?

At times I have tried to drive that idea away from my mind because it feels quite a bit like an Erlangen Program/logical systematization of Mathematics sort of non sense, but I keep running into the same question from different venues.

How large should a corpus be so that the deductive conclusions you mine out of it you can safely regard as valid? As linguists, do you think that, say, all the texts from the gutenberg.org projects are enough to deductively describe the English language inside out? Would all en.wikipedia.org/wiki/ ones suffice? Both? Some more? ...


More information about the Corpora mailing list