[Corpora-List] Corpora for language identification training?

Eric Atwell eric at comp.leeds.ac.uk
Thu Apr 19 11:26:00 CEST 2007

Hi Dean,

Serge Sharoff at Leeds has collected comparable 100-million-word
web-as-corpus corpora for several languages, see

- you can't directly download the text corpora,
since each web-file can only be cached locally to avoid copyright
infringement; but you CAN download the list of URLs and then run a program to
re-create the corpora yourself.

Not sure if this has been directly used in comparative evaluation of
language identification systems. Try asking Google research labs

good luck


Eric Atwell,
Senior Lecturer, Language research group, School of Computing
Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
TEL: 0113-3435430 FAX: 0113-3435468 WWW/email: google Eric Atwell

On Thu, 19 Apr 2007, Dean Jones wrote:

> Hello all,


> I'd like to train a classifier to perform language identification,

> and, before I go ahead and create a corpus for this purpose, I'd like

> to ask whether anyone on this list knows of anything suitable. The

> main reason I'm asking is that I'm particularly interested in finding

> something which has been used in the comparative evaluation of

> language identification systems. Languages that we'd initially like to

> cover are English, French, Italian, German and Spanish. Thanks for any

> help,


> Best wishes,


> Dean.


More information about the Corpora-archive mailing list