[Corpora-List] Urdu corpus

Dan Zeman zeman at ufal.mff.cuni.cz
Sat Aug 29 10:01:53 CEST 2020

Dne 28.08.2020 v 17:56 Fatima Tul Zuhra napsal(a):
> ...
> The UD Urdu dataset has some 5000+ sentences in dependency tree
> format. What I need is a bit huge corpus of plain Urdu text. There is
> one, but it is a bit more expensive. I wonder if there is some freely
> available Urdu plain text corpus?

You could also try http://hdl.handle.net/11234/1-1989 . It is a multilingual web-crawled corpus, with the Urdu part containing 46 million words. (It also contains automatic annotation but the plain sentences can be obtained easily from the data. Nevertheless, I think the sentences are shuffled so you will not see whole documents.)


More information about the Corpora mailing list