[Corpora-List] Corpora for language identification training?

Eric Atwell eric at comp.leeds.ac.uk
Thu Apr 19 11:26:00 CEST 2007


Hi Dean,

Serge Sharoff at Leeds has collected comparable 100-million-word
web-as-corpus corpora for several languages, see
http://corpus.leeds.ac.uk/internet.html

- you can't directly download the text corpora,
since each web-file can only be cached locally to avoid copyright
infringement; but you CAN download the list of URLs and then run a program to
re-create the corpora yourself.

Not sure if this has been directly used in comparative evaluation of
language identification systems. Try asking Google research labs
http://labs.google.com/faq.html#contact

good luck

eric


Eric Atwell,
Senior Lecturer, Language research group, School of Computing
Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
TEL: 0113-3435430 FAX: 0113-3435468 WWW/email: google Eric Atwell


On Thu, 19 Apr 2007, Dean Jones wrote:


> Hello all,

>

> I'd like to train a classifier to perform language identification,

> and, before I go ahead and create a corpus for this purpose, I'd like

> to ask whether anyone on this list knows of anything suitable. The

> main reason I'm asking is that I'm particularly interested in finding

> something which has been used in the comparative evaluation of

> language identification systems. Languages that we'd initially like to

> cover are English, French, Italian, German and Spanish. Thanks for any

> help,

>

> Best wishes,

>

> Dean.

>








More information about the Corpora-archive mailing list