[Corpora-List] Corpora for language identification training?

Lluis Padro padro at lsi.upc.edu
Thu Apr 19 12:05:00 CEST 2007


En/na Dean Jones ha escrit:

> I'd like to train a classifier to perform language identification,

> and, before I go ahead and create a corpus for this purpose, I'd like

> to ask whether anyone on this list knows of anything suitable. The

> main reason I'm asking is that I'm particularly interested in finding

> something which has been used in the comparative evaluation of

> language identification systems. Languages that we'd initially like to

> cover are English, French, Italian, German and Spanish. Thanks for any

> help,

You can try our MM-based identifier. It's GPL, easy to train for
new languages, and it already includes models
for most of the languages you mention

Visit http://www.lsi.upc.edu/~nlp under "resources" menu

Best
--
------------------------------------------------------------------------
*Lluís Padró*
Despatx ?-S112
Campus Nord UPC
C/ Jordi Girona 1-3
08034 Barcelona, Spain Tel: +34 934 134 015
Fax: +34 934 137 833
padro at lsi.upc.edu <mailto:padro at lsi.upc.es>
www.lsi.upc.edu/~padro <http://www.lsi.upc.es/%7Epadro>
------------------------------------------------------------------------
UNIVERSITAT POLITÈCNICA DE CATALUNYA
Dept. Llenguatges i Sistemes Informàtics <http://www.lsi.upc.es>
TALP <http://www.talp.upc.es> Research Center
------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20070419/2939fc9d/attachment.html


More information about the Corpora-archive mailing list