[Corpora-List] Corpus Development

Oliver Mason O.Mason at bham.ac.uk
Sun Apr 20 11:18:09 CEST 2008



> By fully functional, I mean something that can be rightly called a corpus.

That probably opens a can of worms, but one definition of a corpus would be authentic data collected for answering a specific research question. Most corpora are general enough to answer many questions, but 'fully functional' only makes sense in relation to a question. If you want to look at spoken Pashto, then your corpus of written data would be useless. And I don't think you can create a corpus to answer all conceivable questions.

For example, the Bank of English was collected for the purpose of creating a contemporary learners' dictionary. Hence it does not contain historical data, but a variety of genres/text types and data from various regions. As it happens, it can be (and is) also used for looking at other aspects of English apart from just lexis.

I'm not sure if there was a specific purpose for creating the BNC (Lou would know I guess), but it too is suitable for many different research questions. FLOB and Frown were mainly collected for investigating language change, but are also more versatile.

As for software, a corpus is just data. If it is stored in a particular format, many programs can be used to process it, which is desirable, as you never know what the next person will want to use it for.

Oliver



More information about the Corpora mailing list