[Corpora-List] Urdu corpus

Miloš Jakubíček milos.jakubicek at sketchengine.eu
Sat Aug 29 18:29:09 CEST 2020


Dear Fatima,

just to add to Eric: in 2018 we have also built a ~250M web corpus of Urdu...after cleaning some 600+M originally, out of which most turned out to be machine translated :-(

https://www.sketchengine.eu/urtenten-urdu-corpus/

You can get access via a trial account to check the content, let me know if you'd want to download the corpus offline for research purposes.

Best regards, Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk

On Fri, 28 Aug 2020 at 19:25, Eric Atwell <E.S.Atwell at leeds.ac.uk> wrote:


> I am not an expert on SketchEngine;, but I know you can get a free
> 1-month trial licence;
> and it is free to researchers, teachers and students from academic
> institutions in the EU.
>
> For more info on their pricing, see
> https://www.sketchengine.eu/price-list/
>
> eric atwell, Leeds University (non-EU after BREXIT ...)
>
> ------------------------------
> *From:* corpora-bounces at uib.no <corpora-bounces at uib.no> on behalf of
> Fatima Tul Zuhra <fzuhra at cs.qau.edu.pk>
> *Sent:* 28 August 2020 16:56
> *To:* Dan Zeman <zeman at ufal.mff.cuni.cz>
> *Cc:* corpora at uib.no <corpora at uib.no>
> *Subject:* Re: [Corpora-List] Urdu corpus
>
> Thanks to all the responders.
>
> Eric:
> What I know of SketchEngine is that it is not free. Is that right?
>
> Daniel:
> The UD Urdu dataset has some 5000+ sentences in dependency tree format.
> What I need is a bit huge corpus of plain Urdu text. There is one, but it
> is a bit more expensive. I wonder if there is some freely available Urdu
> plain text corpus?
>
> Best regards.
>
> On Fri, Aug 28, 2020 at 8:42 PM Dan Zeman <zeman at ufal.mff.cuni.cz> wrote:
>
> Plus the Urdu Universal Dependencies treebank:
> https://universaldependencies.org/treebanks/ur_udtb/index.html
>
> Best,
> Dan
>
>
> Dne 28.08.2020 v 17:27 Eric Atwell napsal(a):
>
> Fatima,
>
> you can search the 50-million-word Urdu Web Corpus on the SketchEngine
> website
> https://www.sketchengine.eu/urwac-urdu-corpus/
> You can also use SketchEngine to collect your own specialsed Urdu text
> corpus.
>
> You can download Urdu corpora from WWW, eg google "Urdu corpus download"
> or search "Urdu" in www.kaggle.com datasets
>
> for example:
>
> The Holy Quran
> https://www.kaggle.com/zusmani/the-holy-quran
>
> Urdu Language Speech Emotional Corpus
> https://github.com/siddiquelatif/URDU-Dataset
> or https://www.kaggle.com/bitlord/urdu-language-speech-dataset
>
> Urdu Monolingual Corpus
>
> https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5
>
> Urdu-Nepali-English Parallel Corpus
>
> http://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm
>
> English-Urdu Parallel Corpus
> http://ufal.ms.mff.cuni.cz/umc/005-en-ur/
>
> Urdu-Nepali Parallel Corpus
> https://www.kaggle.com/rtatman/urdunepali-parallel-corpus
>
> Urdu / Hindi News Headlines
> https://www.kaggle.com/adnanzaidi/urdu-news-headlines
>
> Urdu Movie Reviews
>
> https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews
>
> iNLTK Urdu News
> https://www.kaggle.com/disisbig/urdu-news-dataset
>
> Urdu Wikipedia
> https://www.kaggle.com/disisbig/urdu-wikipedia-articles
>
> Language Identification dataset
> https://www.kaggle.com/zarajamshaid/language-identification-datasst
>
> urdu sentiment twitter dataset
> https://www.kaggle.com/raheelabibi/urdu-sentiment-data
>
> Urdu Speech Dataset (audio files)
> https://www.kaggle.com/hazrat/urdu-speech-dataset
>
>
> Eric Atwell, Professor of Artificial Intelligence for Language
> PhD tutor; online MSc AI programme leader
> School of Computing, Uni of LEEDS, LS2 9JT, UK
> http://www.comp.leeds.ac.uk/eric https://www.edubots.eu
>
>
>
> ------------------------------
> *From:* corpora-bounces at uib.no <corpora-bounces at uib.no>
> <corpora-bounces at uib.no> on behalf of Fatima Tul Zuhra
> <fzuhra at cs.qau.edu.pk> <fzuhra at cs.qau.edu.pk>
> *Sent:* 28 August 2020 15:37
> *To:* corpora at uib.no <corpora at uib.no> <corpora at uib.no>
> *Subject:* [Corpora-List] Urdu corpus
>
> Hi,
>
> I want to know if there is exists some Urdu corpus that is freely
> downloadable?
>
> Thanks in anticipation.
>
> Regards.
>
> --
> Fatima Tuz Zuhra
> Ph.D. Scholar,
> Quaid i Azam University Islamabad, Pakistan.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing listCorpora at uib.nohttps://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
>
> --
> Fatima Tuz Zuhra
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 16031 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200829/d0abca51/attachment.txt>



More information about the Corpora mailing list