[Corpora-List] Large free access and downloadable corpora

Bert Van de Poel bert.vandepoel at student.kuleuven.be
Mon Feb 3 10:07:17 CET 2020


Dear Jayr and others who might be interested,

Since you didn't specifically mention a language you are focussing on, I will first recommend some Dutch corpora since I'm most familiar with them. While they are not available as relational databases, their XML could easily be parsed into one without much difficulty. The annotated CGN (Corpus Gesproken Nederlands) or SoNaR might be of interest to you. They, as well as other corpora for Dutch and Afrikaans, are available on https://ivdnt.org/taalmaterialen

If you are, as I'd expect, looking for corpora on English, you might want to have a look at the BNC (British National Corpus) which is a 100 million word corpus spanning a lot of genres and situations. More information and download links are available on http://www.natcorp.ox.ac.uk/ There's also American corpora such as MASC and OANC which might be interesting (the penn/hepple files). They can be downloaded from https://www.anc.org/ (though the website seems offline right now). Of course Mark has already mentioned his great corpora, which are the few ones I know that are available as relational databases.

You could also in general go explore on https://vlo.clarin.eu/ for relevant CORPORA. The CLARIN Virtual Language Observatory contains much more than just corpora, but it's quite an easy tool to search through all the different linguistic resources that are available. Not all of them might be available to you since it's a European project, but you could try contacting the authors in that case.

I hope this can get you and those with similar questions started!

Kind regards, Bert Van de Poel

On 2/02/2020 16:17, Mark Davies wrote:
> You might take a look at:
>
> https://www.corpusdata.org/
>
> The samples for these corpora are free, and there is more than 30 million words of data in the free samples (and more than 25 billion words of data in the datasets that can be purchased).
>
>>> and preferably stored on a relational database
> One of the three formats is relational databases-- the same databases that are used for:
>
> https://www.english-corpora.org/
>
> Best,
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
>
> ________________________________________
> From: corpora-bounces at uib.no <corpora-bounces at uib.no> on behalf of Jayr Alencar Pereira <jap2 at cin.ufpe.br>
> Sent: Saturday, February 1, 2020 5:36 AM
> To: corpora at uib.no
> Subject: [Corpora-List] Large free access and downloadable corpora
>
> Hi everybody,
>
> I am looking for a large corpus annotated with at least POS and lemma and preferably stored on a relational database or any other structure that allows searching by tokens.
>
> It is for my MSc project. I am extracting semantic linguistic information like predicate-argument relations. However, the corpus need not be annotated with this kind of information.
>
> Best regards,
>
> --
> ** Pax et bonum
>
> Jayr Alencar Pereira.
> Master's Degree Student
> Center of Informatics, Federal University of Pernambuco, Recife - Brazil
> Homepage: www.jayralencar.com.br<http://www.jayralencar.com.br>
> GitHub: @jayralencar<https://github.com/jayralencar>
> CV Lattes<http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K8561724U9>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list