[Corpora-List] Urdu corpus
Dan Zeman
zeman at ufal.mff.cuni.cz
Sat Aug 29 10:01:53 CEST 2020
Dne 28.08.2020 v 17:56 Fatima Tul Zuhra napsal(a):
> ...
> The UD Urdu dataset has some 5000+ sentences in dependency tree
> format. What I need is a bit huge corpus of plain Urdu text. There is
> one, but it is a bit more expensive. I wonder if there is some freely
> available Urdu plain text corpus?
You could also try http://hdl.handle.net/11234/1-1989 . It is a
multilingual web-crawled corpus, with the Urdu part containing 46
million words. (It also contains automatic annotation but the plain
sentences can be obtained easily from the data. Nevertheless, I think
the sentences are shuffled so you will not see whole documents.)
Dan
More information about the Corpora
mailing list