[Corpora-List] Is Language Identification Really Solved?

liling tan alvations at gmail.com
Tue Jun 30 12:19:29 CEST 2015

Here's the version with the organic non-shorten links, let's go back to the discussion Language ID ;)

Dear (Corpora/Computational) Linguists and NLP/ML practitioners,

*How much a misconception is language identification a "solved task"?*. 5 years ago, there was some discussion: http://www.aclweb.org/anthology/N10-1027.pdf

Today, *are we "still a long way off perfect language identification of web documents, as evaluated under realistic conditions"?*

The much shown 99+% for language ID is usually based on a fix set of not so related languages.

*Do we really know which languages are similar?* Finding similar languages from a corpora of 1000+ languages seems difficult ( http://www.aclweb.org/anthology/W/W14/W14-2211.pdf)

It also seems like different people are having ping-pong conclusions as to whether language ID for many many languages is possible: http://link.springer.com/chapter/10.1007/978-3-642-40585-3_60 <http://goo.gl/TR0MUU> and http://research.microsoft.com/pubs/138760/Xia-Lewis-Lewis.pdf <http://goo.gl/RL9SNV>

Most recent, Discriminating between Similar Languages (DSL) Shared Task also shows that what we know about language ID is still far from perfect: https://github.com/Simdiva/DSL-Task/blob/master/DSL2015-results.md <https://goo.gl/PBtXjd>

*Is Language Identification Really Solved?*

