You might take a look at:


The samples for these corpora are free, and there is more than 30 million words of data in the free samples (and more than 25 billion words of data in the datasets that can be purchased).

>> and preferably stored on a relational database

One of the three formats is relational databases-- the same databases that are used for:



Hi everybody,

I am looking for a large corpus annotated with at least POS and lemma and preferably stored on a relational database or any other structure that allows searching by tokens.

It is for my MSc project. I am extracting semantic linguistic information like predicate-argument relations. However, the corpus need not be annotated with this kind of information.

