[Corpora-List] Lexical bundles - and meaningful items...

Chris Butler csblists at telefonica.net
Fri Jul 8 11:08:00 CEST 2005

Dear John and other list members,

Ute Römer said:

"But I suppose that concordances of frequent
3-grams may still lead you to some interesting (and meaningful) 4- and
5-word items."

For lists of 3-word strings as well as longer ones, derived from English
corpora, you might like to look at the following, if you haven't already
done so:

Stubbs, Michael and Isabel Barth (2003) 'Using recurrent phrases as text
type discriminators: a quantitative method and some findings." Functions of
Language 10(1): 61-104.

For similar data from Spanish, derived from smaller corpora (some as small
as 125000 words, none bigger than 1 million words), see

Butler, Christopher S. (1997) "Repeated word combinations in spoken and
written text: some implications for Functional Grammar." In C. S: Butler, J.
H. Connolly, R. A. Gatward and R. M. Vismans (eds.) A Fund of Ideas: Recent
Developments in Functional Grammar. Amsterdam: Institute for Functional
Research into Language and Language Use (IFOTT).

[As this is in a rather obscure publication which may be difficult for
people to get hold of, I could send an electronic version to anyone who is

Also, Bengt Altenberg says in the following paper that most of the recurrent
sequences he isolated from the London-Lund Corpus were pretty short, with an
average of 3.15 words, and he gives a lot of examples of phraseologically
interesting 3-word sequences:

Altenberg, Bengt (1998) On the phraseology of Spoken English: the evidence
of recurrent word combinations." In A. P. Cowie (ed.) Phraseology: Theory,
Analysis, and Applications". Oxford: Clarendon Press.

Chris Butler


Ute Römer
English Department
University of Hanover
Königsworther Platz 1
30167 Hannover

Phone: +49 (0)511 762 2997
Fax: +49 (0)511 762 2996
E-mail: ute.roemer at anglistik.uni-hannover.de

> -----Original Message-----

> From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On

> Behalf Of Jenny Eagleton

> Sent: Monday, July 04, 2005 4:46 AM

> To: corpora at uib.no

> Subject: [Corpora-List] Lexical bundles









> I notice that all of the studies I have read on

> this topic have

> focussed on 4 word bundles and that you they have

> all used what I

> would call large corpora i.e. many millions of

> words. The rationale

> seems to be that with 5 word bundles you do not

> get enough to analyse

> and that with three word bundles there are

> probably too many to

> handle.


> I want to do a study of bundles on a specific

> corpus I have, but

> which only has 600,000 words. To be able to work

> with large numbers

> of bundles, it would therefore make sense to focus

> on 3 word bundles.

> I could do a study on 4 word bundles, but the

> sample would be smaller.



> So my question is, do people see any disadvantages

> on focusing on

> 3-word bundles and, if so, what they might be?


> Looking forward to hearing your responses.




More information about the Corpora-archive mailing list