[Corpora-List] Last CFP: NAACL Workshop on Vector Space Modeling for NLP

John F Sowa sowa at bestweb.net
Thu Feb 26 20:32:02 CET 2015

On 2/24/2015 5:20 AM, Shay Cohen wrote:
> NLP started with methods based on pure symbolic analysis of language.
> Statistical methods were introduced to NLP in the 1990s,

Some historical points:

1. Statistical methods for content analysis were pioneered by

Laswell (1948) and Berelson (1952), and they were computerized

as soon as computers became widely available. For references,

see http://en.wikipedia.org/wiki/Content_analysis

2. Charles C. Fries pioneered the use of corpora in language

analysis from the 1920s to the 1950s. For references, see


3. As early as 1947, Warren Weaver recognized the potential

for computers in machine translation. He was instrumental

in getting funding for it. He was also the coauthor with

Claude Shannon of _The Mathematical Theory of Communication_

(1949). That book stimulated a considerable body of research

in the application of statistical methods to language analysis.

4. Chomsky's thesis adviser, Zellig Harris, pioneered transformational

methods. Unlike Chomsky, Harris emphasized the use of corpora and

statistics. See the collection, _The Legacy of Zellig Harris_:


5. Victor Yngve, a pioneer in MT, was also a pioneer in using

statistics in language analysis. Hutchins summarizes both

in http://aclweb.org/anthology/J/J12/J12-3001.pdf

6. As the director of the MT project at MIT, Yngve hired Chomsky as

a promising young PhD whose syntactic methods might be useful.

Chomsky also taught a course in linguistics and published his

notes as _Syntactic Structures_ (1957). In that book, Chomsky

strongly rejected statistical methods and the use of corpora.

7. In the 1980s, Fred Jelinek used statistical methods for a project

on speech recognition at IBM Research. John Cocke suggested that

similar methods might be useful for MT. In those days, they

swamped the capacity of the largest IBM mainframes. By the 1990s,

they could run on minicomputers and workstations.


More information about the Corpora mailing list