[Corpora-List] Getting articles from newspapers to compile a corpus

Gisle Andersen Gisle.Andersen at nhh.no
Fri Nov 30 12:29:02 CET 2012

Dear Matías,

For Norwegian a 1 billion word Newspaper Corpus is compiled based on web crawler technology using wget and w3mir, followed by subsequent boilerplate/duplicate removal, text annotation, etc. It contains texts from 24 national/regional/local newspapers covering the period from 1998 to the present. For details, check this reference:

Andersen, Gisle and Hofland, Knut (2012), 'Building a large monitor corpus based on newspapers on the web', in Gisle Andersen (ed.), Exploring Newspaper Language - Using the web to create and investigate a large corpus of modern Norwegian (Amsterdam: John Benjamins), 1-30.

Kind regards, Gisle Andersen, NHH

More information about the Corpora mailing list