[Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics with R'-- re Louw's endorsement

Stefan Th. Gries stgries at gmail.com
Thu Aug 14 20:50:46 CEST 2008


Now that matters of substance are being discussed, let me chime in again (this time only on my behalf). There are two issues I am concerned with here, one having to do with software, the other having to do with theoretical orientation.

As to the former, it is interesting that the use of a particular software tool appears to be in part responsible for so many concerns. Let me quote a part of Wolfgang's posting
> For R-software, it does no matter what kind of strings of information
> bit are processed. It could be language, but it could also be DNA
> sequences or the ciphers behind the "3." in the number pi. To me it
> seems that much of what will be presented at the camp is relatively
> application-free. Language is just one of many possible applications.

Well, last time I checked that is true of any concordancing software or of any scripting language: any concordancer (most notably those that can handle Unicode) can process any sequence of strings so I fail to see in what way this particular characteristic makes R special. A more general but just as correct version of Wolfgang's paragraph is therefore this:
> For Perl, Python, R, and in fact any other concordancer, it does no
> matter what kind of strings of information bit are processed.
> It could be language, but it could also be DNA sequences or the
> ciphers behind the "3." in the number pi. To me it seems that much
> of what will be presented at the camp is relatively application-free.
> Language is just one of many possible applications.

This raises two interesting questions, the first tongue-in-cheek, the other more substantive: (i) If what we offered had been a Bootcamp 'Corpus Linguistics with AntConc' - would that have raised less resistance? ;-) (ii) What, then, is special about R? More specifically, (iia) what is special about R compared to concordancing software, and (iib) what is special about R compared to other programming languages?

As for (iia), R is different in the sense that it is much more powerful than any concordancer can be (which is no critique of these tools, after all I link to all I know on my own website). (a) No concordance tool can do this (since Wolfgang mentioned morphology): - download Adam Kilgarriff's BNC frequency list from the web; - retrieve from it all words tagged as adjectives and their frequencies as well as the number of files in they occur; - contrast the frequencies and the number of files they occur in of the adjectives ending in -ic with those ending in -ical in a graph and with a statistical test; - contrast the frequencies of the nouns followed by adjectives ending in -ic with those ending in -ical in a graph and with a statistical test. (This issue was first raised by Marchand, and -ic/-ical adjectives were investigated in several studies, most recently by Mark Kaunisto (e.g., in /English Studies/) and myself (in /ICAME Journal/ and /Internt; J of Corp Ling/); whoever looks at these studies will find meaning is discussed a lot in them.)

(b) No ready-made concordance tool can do this (since Wolfgang mentioned language acquisition): - load the all files for one child from the language acquisition corpus data base CHILDES; - clean them up in terms of line breaks etc.; - either generate concordances on them; or - transform all the data for one child into an Excel-readable table to perform searches and many different levels of annotation at the same time (e.g., find the use of sleep but only when it's used as a verb in a loud voice). (I know no line-based concordancer that can handle this kind of multi-tiered annotation)

(c) No ready-made concordance tool can do this: to compile web corpora - send a search word to Google and collect up to 1,000 filetype-specific links regarding that search word; - download all the files to which Google linked onto the hard drive; - harvest all the links on these files; and - crawl the web along these links to download all the documents that are not further than three links away from the original link.

As for (iib), of course other scripting languages can do these things, too. However, as someone who also uses Perl (and would like to learn more Python if he had the time), what is particulary appealing about R is that, from my personal experience, - it is much simpler than, say, Perl because (this is now for co-geeks ;-)) it has several high-level functions that do things for which you need to write subroutines for in Perl and it is optimized for handling vectors. Once you load a corpus into a vector, you just write "sort(table(corpus))" and have a sorted frequency list - you don't need to loop over an array and dump stuff into a to-be-sorted hash. - as I will mention in my masterclass in Granada, R can provide virtually all the technical functionality corpus linguists need *in a single environment* (frequency lists for those who don't get duped by them, concordances, collocations, dispersion plots, lexical frequency profiles, gravity, concgrams, Unicode and XML file handling, interaction with MySQL databases, you name it) but at the same time, it can perform all the statistical tests you have seen in corpus-linguistic research (Biber's factor analyses for register variation, Leech et al's loglinear analysis on genitives, Geisler's logistic regression, again you name it), and it has powerful graphical capabilities; for example, the table on my website linking to the bootcamp page has the same structure as all tables, but (i) the sizes in which the numbers are plotted reflects the size of the residuals (i.e., bigger observed numbers deviate more from the expected frequencies than smaller numbers, where bigger and smaller are to be understood in terms of plotting size), and (ii) the coloring indicates how the observed frequencies deviate from the expected ones: blue and red indicate the observed frequencies are larger and smaller than the expected ones, which immediately gives all the structure in the data away. To sum up, you don't need to use one or more concordancers to get the corpus data you want, then use Excel to get them into shape for a statistical analysis with SPSS, then put the SPSS results into whatever to create your graphs, ... you do it all in one and the same environment. Yes, there is a learning curve, but (i) it's only one learning curve because it's only one software, and (ii) every software has a learning curve.


> Dagmar S. Divjak's and Stefan Gries' boot camp is, as I see it, not about discussing corpus linguistics
That is correct, it's about *doing* corpus linguistics.


> To me it seems that much of what will be presented at the camp is relatively application-free.
That is incorrect: if the above examples are not corpus-linguistic applications, then I do not know what a corpus-linguistic application is. (This does not mean we're gonna do exactly these things, which will depend on the participants' ideas, too.)

Let me now turn to the more theoretical implications of Wolfgang's posting. I will begin with a few necessary paraphrases.


> The journal he co-edits bears the name Corpus Linguistics and Linguistic Theory. The only language theory that Gries accepts is cognitive linguistics.
This may be a bit of nit-picking, but let me change that to what I think is a more correct characterization of my theoretical beliefs: "The theoretical approach that Gries is most associated with is that of cognitively-inspired approaches."


> Meaning, for Gries, is a theoretical and therefore a cognitive concept. It plays no role in his version of corpus linguistics.
I actually believe something else: "Meaning, for Gries, is a cognitive entity and he thinks it is useful to examine it not in a theoretical vaccum but from a cognitively-inspired perspective." It is unclear how one can read Stefanowitsch's and my collostructional stuff or my papers on polysemy and near synonymy (the latter with the co-organizer of this bootcamp) and say meaning plays no role - last time I checked, polysemy, synonymy, and constructional semantics were issues of meaning. Are they not?


> Old-fashioned corpus linguists like myself have to accept that the label corpus linguistics has, over the last decade, been hijacked by theoretical linguists of all feathers.
Again a paraphrase that does away with the negative semantic prosody (an important concept in corpus linguistics! :-) ): "Corpus linguists like myself are glad to see corpus linguistic methods are now applied by (theoretical) linguists of all feathers."


> Its role is to provide empirical data that will then be interpreted from the theoretical platform of cognitive linguistics.
I wonder whether Harald Baayen, Tom Wasow, John Hawkins, Joan Bresnan, Marianne Hundt, Christian Mair and a zillion others who undoubtedly make descriptively AND theoretically relevant observations would consider themselves cognitive linguists. Yes, scholars such as Doris Schoenefeld, Michael Barlow, Suzanne Kemmer, and myself have argued for a greater interaction between corpus linguistics and cognitively-inspired approaches, but singling out cognitive linguistics as the only platform is an overly narrow perspective of the range of theories to which corpus linguistics can contribute.


> Cognitive linguistics tells Stefan Gries what a morpheme, a word, a phrase or a pattern is.
It does??? I don't think so ... And I thought each of us has an idea of what a morpheme, a word, a phrase or a pattern is. It's been a while since I came across a corpus-linguistic paper which started out by questioning what a morpheme is.


> This, then, is his input into the toolbox that he and many others now call corpus linguistics.
Well, a concordancer or a scripting language requires some input what to search for, doesn't it? In a recent paper, Louw provides a concordance display of all sorts of. Surely he could only get these data by entering "all sorts of into a software tool. (I think "all sorts of" could be called, let's say, "unit of meaning") This, this is not *my* input, any corpus linguist who doesn't simply read the whole corpus inputs something into a tool.

Let me thank Wolfgang for his thoughts and I would like to end this treatise (thanks to those who bore with me so long) with a quote from his website to which I full-heartedly agree:

"The word is not privileged in terms of meaning. [exactly the claim of cognitive linguists!] The corpus linguist posits endocentric entities, formally held together by some local grammar, and calls these entities (complex) lexical items or, alternatively, units of meaning. Lexical items can be single words, compounds, multi-word units, phrases, and even idioms. Just like single words, (complex) lexical items tend to recur in a discourse. This is why statistical procedures [!!] can be used for detecting them in a reasonably large corpus, as significant [!!] co-occurrences of the same entities." (<http://www.english.bham.ac.uk/who/myversion.htm>, accessed 5 seconds ago)

STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries -----------------------------------------------



More information about the Corpora mailing list