[Corpora-List] French corpora

Djamé Seddah djame.seddah at free.fr
Tue Dec 16 14:31:28 CET 2014

Hi Antoinette,

here's a version of the Est Republicain corpus with state-of-the-art pre-processing (lemma, pos, mwe annotation).

This data set was used for the SPMR 2014 Shared task (see below for notes)


If you're interested by manually validated data sets, you can have a look to these:

The Cr#pBank is here: (1700 sentences, another 2,7k coming up) http://pauillac.inria.fr/~seddah/FrenchSocialMediaBank-v0.9.1beta.tar.gz

Twitter, Facebook, Doctissimo (health forum), jeuxvideos.com (video games) both noisy and less noisy text [1]

The Sequoia treebank is here: (3200) (both const and dep) https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia Local neswpaper, wikipedia -history part-, Biomedical and Europarl.

it's described in French in [2] and in English in the first half of [3]

the Deep Sequoia (the same but with deep syntax information) is here http://deep-sequoia.inria.fr [4]

Also you can query some annotated corpora collected circa 2004 by Susanne Salmon-Alt and colleagues, it's the freebank base (their newswire part is available via a password protected link but the raw text comes from the Ananas corpus [5])


of course if you need the French Treebank (free for research, Le Monde text), please contact Anne Abeillé and Clément Planck (clement.plancq at linguist.univ-paris-diderot.fr), he's in charge of the distribution of the original XML sources, Marie Candito (Marie.Candito at gmail.com) for the current phrase-based & dependency ready-to-parse versions)

if you really need huge annotated data set from newswire text such as parsed French AFP streams of the last 4 years, please contact Eric de la Clergerie <Eric.De_La_Clergerie at inria.fr>

it's also my understanding that some of the texts from the Monde Diplomatique are subjected to a Creative Common license but I don't know if someone took the time to gather some of them into a corpus. Many more ressources exist (Football corpus, transcripted broadcast news, Litterature's source and so on) so I'm sure you'll find what you want, ask again otherwise.

Best, Djamé

[1] The French Social Media Bank: a Treebank of Noisy User Generated Content, Djamé Seddah, Benoit Sagot, Marie Candito, Virginie Mouilleron, Vanessa Combet, COLING 2012, Mumbay, India [2] Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical,Candito M.-H. and Djamé Seddah, 2012, Proceedings of TALN'2012, Grenoble, France [3] A Word Clustering Approach to Domain Adaptation: Robust parsing of source and target domains Djamé Seddah, Marie Candito and Enrique Henestroza Anguiano (to appear in Grammars, parsers and recognisers special issue of the Journal of Logic and Computation 12-27 [4] Deep Syntax Annotation of the Sequoia French Treebank,Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Eric de la Clergerie, 2014 , Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 2014 [5] Le projet ANANAS : Annotation Anaphorique pour l’Analyse Sémantique de Corpus, Susanne Salmon-Alt, 2002, Proceedings of TALN'2002, Nancy


Notes on the Est-Republicain (spmrl version)

Unlabeled French Data Set: This data set is derived from the release of the Est Republicain Corpus [1], preprocessed following [2], with morphological predictions (lemma, pos, features) generated by Morphette [3] trained on the SPMRL 2013 Shared Task French data set (train full/gold). MWEs annotation have been added via Lgtagger [4] trained on the same data.

Statistics: # of sentences : > 8 millions # of tokens : > 159 millions.

Annototation scheme: Note that the morphological annotation schemes follow exactly the one present in the French "gold files". Besides the pred=y and mwehead=POS+ features (which mark respectively a token part of compound/Mwe and the part-of-speach of the whole compound -- as taken from the consituent file, see the French data set documentation in FRENCH_SPMRL/doc/readme.spmrl -- ) we also included the predicted dependencies for the internal structures of the compound in the fields HEAD, DEPREL, PHEAD, PDEPREL. Adding them as features instead of

"pre-bracketed"dependencies is trivial and left to the participants if they so wish.

Quality of the annotations (on the dev set)

lemmas acc: :99.10 cpos acc: 97.98 fpos acc: 97.43 feat acc: 81.31 feat acc (no mwe features): 92.79

MWE recognition's performance is at 81.2 % of F-score on the Dev set. [5]

Full Mate (graph based) dependencies prediction will be made available soon (at least for a significant subset of this data set).

Djamé Seddah, Marie Candito and Matthieu Constant

[1] Bertrand Gaiffe and Kamel Nehbi. 2009. Le corpus de l'Est Républicain. Technical report, Atilf http://www.cnrtl.fr/corpus/estrepublicain/. [2] Djamé Seddah, Marie Candito, Benoit Crabbé and Enrique Henestroza Anguiano. 2012. Ubiquitous Usage of a Broad Coverage French Corpus: Processing the Est Republicain corpus, , in Proceedings of LREC'2012 [3] Grzegorz Chrupała, Georgiana Dinu, and Josef van Genabith. 2008. Learning morphology with morfette. In Proc. of LREC 2008, Marrakech, Morocco. [4] Matthieu Constant, Anthony Sigogne, and Patrick Wa- trin. 2012. Discriminative strategies to integrate multiword expression recognition and parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 204–212, Stroudsburg, PA, USA. Association for Computational Linguistics. [5] Constant M., Candito M. and Seddah D., 2013. The LIGM-Alpage architecture for the SPMRL 2013 Shared Task: Multiword Expression Analysis and Dependency Parsing, Proceedings of the Fourth SPMRL Workshop, Seattle, USA

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8083 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141216/7ad4b444/attachment.txt>

More information about the Corpora mailing list