[Corpora-List] 10M word French corpus and treebank freely available

Sylvain Kahane sylvain at kahane.fr
Sat Nov 24 14:22:59 CET 2018

Dear colleagues,

We are pleased to announce that the portal Orfeo is open access to the following address :


Orfeo gives access to the Corpus for the Study of Contemporary French: (CEFC). The corpus consists of 10 M. words:

4 million words from spoken French transcriptions of about 350 hours of recordings, collected in France, Switzerland and Belgium and in different diaphasic situations (face-to-face conversations, interviews, debates, classroom interactions, lectures, sermons, and speeches, as well as radio and television programs). 6 million words of written texts from a wide range of genres (e.g. literature, scientific texts, regional and national press, essays, academic, non-standard writings).

The portal gives access to the acoustic files and textual resources. The corpus is searchable for textual and register variables available from the metadata, as well as for lexical and morpho-syntactic (POS) annotations. All the queries return orthographic transcriptions aligned with audio files.

The entire corpus is further semi-automatically annotated with syntactic dependencies. The search tool can return dependencies patterns. About 150,000 words have been corrected and constitutes the gold treebank.

All files (texts, sounds and annotations) are freely downloadable. Guides are provided for all types of annotations.

The treebank and the platform development have been funded by the French National Agency (project ANR Orféo, directed by Jeanne-Marie Debaisieux). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 21011 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181124/7afb451b/attachment.txt>

More information about the Corpora mailing list