[Corpora-List] Urdu corpus

Surangika Ranathunga surangika at cse.mrt.ac.lk
Fri Aug 28 18:06:36 CEST 2020


Hi Fatima Please check this out: https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs179904 It reports an Urdu corpus of 1.25 billion words.

Regards Surangika

On Fri, 28 Aug 2020, 21:27 Fatima Tul Zuhra, <fzuhra at cs.qau.edu.pk> wrote:


> Thanks to all the responders.
>
> Eric:
> What I know of SketchEngine is that it is not free. Is that right?
>
> Daniel:
> The UD Urdu dataset has some 5000+ sentences in dependency tree format.
> What I need is a bit huge corpus of plain Urdu text. There is one, but it
> is a bit more expensive. I wonder if there is some freely available Urdu
> plain text corpus?
>
> Best regards.
>
> On Fri, Aug 28, 2020 at 8:42 PM Dan Zeman <zeman at ufal.mff.cuni.cz> wrote:
>
>> Plus the Urdu Universal Dependencies treebank:
>> https://universaldependencies.org/treebanks/ur_udtb/index.html
>>
>> Best,
>> Dan
>>
>>
>> Dne 28.08.2020 v 17:27 Eric Atwell napsal(a):
>>
>> Fatima,
>>
>> you can search the 50-million-word Urdu Web Corpus on the SketchEngine
>> website
>> https://www.sketchengine.eu/urwac-urdu-corpus/
>> You can also use SketchEngine to collect your own specialsed Urdu text
>> corpus.
>>
>> You can download Urdu corpora from WWW, eg google "Urdu corpus download"
>> or search "Urdu" in www.kaggle.com datasets
>>
>> for example:
>>
>> The Holy Quran
>> https://www.kaggle.com/zusmani/the-holy-quran
>>
>> Urdu Language Speech Emotional Corpus
>> https://github.com/siddiquelatif/URDU-Dataset
>> or https://www.kaggle.com/bitlord/urdu-language-speech-dataset
>>
>> Urdu Monolingual Corpus
>>
>> https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5
>>
>> Urdu-Nepali-English Parallel Corpus
>>
>> http://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm
>>
>> English-Urdu Parallel Corpus
>> http://ufal.ms.mff.cuni.cz/umc/005-en-ur/
>>
>> Urdu-Nepali Parallel Corpus
>> https://www.kaggle.com/rtatman/urdunepali-parallel-corpus
>>
>> Urdu / Hindi News Headlines
>> https://www.kaggle.com/adnanzaidi/urdu-news-headlines
>>
>> Urdu Movie Reviews
>>
>> https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews
>>
>> iNLTK Urdu News
>> https://www.kaggle.com/disisbig/urdu-news-dataset
>>
>> Urdu Wikipedia
>> https://www.kaggle.com/disisbig/urdu-wikipedia-articles
>>
>> Language Identification dataset
>> https://www.kaggle.com/zarajamshaid/language-identification-datasst
>>
>> urdu sentiment twitter dataset
>> https://www.kaggle.com/raheelabibi/urdu-sentiment-data
>>
>> Urdu Speech Dataset (audio files)
>> https://www.kaggle.com/hazrat/urdu-speech-dataset
>>
>>
>> Eric Atwell, Professor of Artificial Intelligence for Language
>> PhD tutor; online MSc AI programme leader
>> School of Computing, Uni of LEEDS, LS2 9JT, UK
>> http://www.comp.leeds.ac.uk/eric https://www.edubots.eu
>>
>>
>>
>> ------------------------------
>> *From:* corpora-bounces at uib.no <corpora-bounces at uib.no>
>> <corpora-bounces at uib.no> on behalf of Fatima Tul Zuhra
>> <fzuhra at cs.qau.edu.pk> <fzuhra at cs.qau.edu.pk>
>> *Sent:* 28 August 2020 15:37
>> *To:* corpora at uib.no <corpora at uib.no> <corpora at uib.no>
>> *Subject:* [Corpora-List] Urdu corpus
>>
>> Hi,
>>
>> I want to know if there is exists some Urdu corpus that is freely
>> downloadable?
>>
>> Thanks in anticipation.
>>
>> Regards.
>>
>> --
>> Fatima Tuz Zuhra
>> Ph.D. Scholar,
>> Quaid i Azam University Islamabad, Pakistan.
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing listCorpora at uib.nohttps://mailman.uib.no/listinfo/corpora
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
>
>
> --
> Fatima Tuz Zuhra
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 14133 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200828/c3a8ace0/attachment.txt>



More information about the Corpora mailing list