[Corpora-List] New multi-parallel corpus available (Indic Languages)

Miles Osborne miles at inf.ed.ac.uk
Tue Jan 24 16:29:49 CET 2012


The Indic multi-parallel corpus consists of approximately 2000 Wikipedia sentences translated into the following Indic languages:

Bengali Hindi Malayalam Tamil Telugi Urdu

The data was translated by non-expert translators hired over Mechanical Turk and so it is of mixed quality. Every source source segments was translated redundantly by four different Turkers. Note that we have translated paragraphs, so the data should be of interest to researchers looking at discourse as well as machine translation.

http://homepages.inf.ed.ac.uk/miles/babel.html

Miles Osborne (Edinburgh) Chris Callison-Burch (JHU)

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.



More information about the Corpora mailing list