[Corpora-List] Community-driven corpus building

Martin Reynaert reynaert at uvt.nl
Thu Apr 14 15:26:24 CEST 2011

Dear list,

In the thread 'Spellchecker evaluation corpus', Stefan Bordag just described a plug-in which strikes me as having far greater potential than the use he envisages, hence this new thread.

Stefan wrote: "perhaps producing such a corpus wouldn't be so difficult after all. Perhaps all it takes is a custom plugin for Open Office which people can use when they review documents they write in OO for errors. In this plugin, simply by klicking some accept button provided by the plugin they'd consent to have both the original version and the revised version sent to some database known to the plugin. With some time perhaps a sizeable collection of all sorts of corrections in all sorts of languages could be produced by this.".

What Stefan defines here appears to me to be a killer application for corpus building.

Setting up this kind of system implies that people donate their texts and their texts' editing history. The manner in which this is done would in fact allow for the fully automatic, community-driven building of corpora of contemporary written text. For any language, for any kind of corpus research.

This would solve the two major bottle-necks we encounter daily in building a large reference corpus of contemporary written Dutch: IPR-settlement and metadata/text processing.

Who better than the author at time of donation to supply the necessary metadata? :

- personal: allowing the author to determine what level of personal information (s)he wishes to be associated with the particular text - text: information about encoding, text type, register, style - language: with possibility of indicating his/her level of proficiency - processing: whether spelling/grammar checking was applied, using which particular tools... - etc.

However casually mentioned, some types of information listed above are not and cannot be collected in our corpus, today.

All this metadata could then automatically be incorporated in a suitable metadata scheme (e.g. CMDI) and the text itself, properly segmented in sections, paragraphs etc. with proper identification of headers/footers, tables, pictures, etc. saved in a suitable xml-format and sent on. Compare this to what one currently obtains automatically converting from PDF...

The receiving web service would then incorporate the text into the appropriate subcorpus according e.g. to text type, assign it the proper file name with the appropriate file number and further make it available to other web services for furher linguistic enrichment: tokenization, pos-tagging, automatic correction/normalization, syntactic parsing, etc. This would also entail gathering the immensely valuable information on the writing process itself, given the included edit histories, of course.

I have a dream... To which I might add an adjective denoting a high level humidity. In which case, donating this very text using the service outlined above, I would naturally attach a low level of divulgence of personal information within the corpus ;0)

Martin Reynaert Coordinator Work Package Corpus Building SoNaR ILK UvT The Netherlands

