[Corpora-List] Corpora for language identification training?

Mike Maxwell maxwell at umiacs.umd.edu
Thu Apr 19 14:08:08 CEST 2007


Dean Jones wrote:
> I'd like to train a classifier to perform language identification,
> and, before I go ahead and create a corpus for this purpose, I'd like
> to ask whether anyone on this list knows of anything suitable.

I presume you're asking about spoken language ID, not ID of language in
computer-readable texts, nor from images of printed or handwritten text.

There have been a number of evaluations of spoken language ID by NIST.
You might have a look at this:
http://www.nist.gov/speech/tests/lang/2003/index.htm
I believe the data for all the evals was provided by the LDC, although a
quick glance at the LDC catalog (http://www.ldc.upenn.edu/Catalog/)
didn't show it.
--
Mike Maxwell
maxwell at umiacs.umd.edu





More information about the Corpora-archive mailing list