we are happy to announce that we've recently completed work on frWaC, a new corpus resource for French.
Like deWaC (for German), itWaC (for Italian) and ukWaC (for English), frWaC is a mega-corpus (~ 1.6 billion words) obtained by crawling and post-proccesing Web data. It is available both in a plain text version, and in an annotated version, which includes Part-of-Speech and lemma information. An earlier version of the corpus, and the procedure for its construction, are described here:
Ferraresi, A., S. Bernardini, G. Picci and M. Baroni (2010) “Web Corpora for Bilingual Lexicography: A Pilot Study of English/French Collocation Extraction and Translation”. In Xiao, R. (ed.) Using Corpora in Contrastive and Translation Studies. Newcastle: Cambridge Scholars Publishing.
For more details on the corpus and how to obtain it, please visit the WaCky project website:
The WaCkies -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1189 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20100408/eb013250/attachment.txt>