[Corpora-List] Arabic Corpora

Uwe Quasthoff quasthoff at informatik.uni-leipzig.de
Thu Feb 1 08:17:27 CET 2018

Hi Alia,

may I draw your attention to the Leipzig Corpora Collection http://corpora.uni-leipzig.de/ ? There are corpora in more than 200 languages, including Arabic with sources from several countries. The corpora are sentence separated and sentence scrambled. Each sentence comes with a source URL. Some more data like word frequencies and word cooccurrences are included. The corpora can be downloaded directly as plain text or MySQL database files at http://wortschatz.uni-leipzig.de/en/download All data were collected by random Web crawling, including newspapers.

From each of the different corpora, up to one million sentences is available for free download. If you need more data, we can provide them for free in the case of academic use.

Please have a look at our data and feel free to contact me.


Uwe Quasthoff

NLP Group Dept. of Computer Science Leipzig University Germany

Am 31.01.2018 um 20:02 schrieb Alia Bahanshal:
> Hello,
> Is there any open source Arabic corpora I can use for deep learning
> research purposes?
> Thanks
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list