Multi-word units in the BNC

Leech, Geoffrey g.leech at
Tue Jul 5 12:22:00 CEST 2005


Just to point out that the BNC list suggested by A.H. interprets "multi-word" in a rather special sense, which approximates to the notion of "grammatical idiom" - these are cases where a sequence of orthographic words, from the point of view of POS-tagging, needs to be treated as a single "word". Many of them are foreign/classical expressions like "in flagrante delicto" and "hors d'oeuvre" where it doesn't make sense (in a corpus of English) to treat the individual words as having a separate grammatical function. The same applies to some native English expressions like "hoity toity" and "higgledy piggledy" (what is the grammatical function of "hoity" as distinct from "toity"? - the question doesn't make sense!). More linguistically controversial are sequences of words that act as a single preposition or conjunction, like "in spite of".

This is an interesting list that might be useful for someone contemplate POS-tagging of English text, but the notion of "grammatical idiom" is inevitably fuzzy, and the list cannot be considered comprehensive, except in the sense that it covers the items treated as such in the BNC.

Anyway, if you are looking for a more inclusive notion of "multi-word expression" it might be better to tap into a resource like the excellent "View" facility Mark Davies provides for MWUs in the BNC - see - there may be others.


Geoff Leech

Message: 1
Date: Sat, 02 Jul 2005 19:59:13 +0200
From: Andre Halama <>
To: dychen <>
Subject: [Corpora-List] For list of multi-word units


Hash: SHA1

dychen wrote:

> I am looking for a list or database of English Multi-word units

> (including phrases, idioms, compounds, etc), which is freely available

> for research.

Here is a link to the list of multiword tokens used in BNC2:



More information about the Corpora-archive mailing list