[Corpora-List] New version of the French TreeBank

Yonatan Ginzburg yonatan.ginzburg at univ-paris-diderot.fr
Thu Apr 20 11:39:14 CEST 2017


The laboratoire de linguistique formelle (www.llf.cnrs.fr <http://www.llf;cnrs.fr>) is happy to announce a new version of the French treebank and a new site :

http://ftb.linguist.univ-paris-diderot.fr/ <http://ftb.linguist.univ-paris-diderot.fr/>

The French Treebank (extracts from Le Monde newspaper 1990-93)(Abeillé et al. 2003) is a unique large scale resource for French with rich syntactic annotations (compounds, lemmas, inflexion, constituents, grammatical functions…), and human validation, for 21550 sentences (664 500 tokens).

It has been developed by A. Abeillé et al. since 1997 with the support of Institut Universitaire de France and University Paris Diderot.

It is freely distributed for research purposes and used by more than 150 laboratories and companies across the world.

This new version has:

- 3.000 sentences more (about 90.000 words),

- additionnal annotations on the whole corpus : all compounds have been annotated for their parts

- metadata have been added (article, author, date, domain).

- additional versions : utf-8 PennTreebank format, utf-8 Tiger-xml format, CoNNL format (Candito et al. 2009 ; 2010)

For search queries, the Penntreebank format may be used with T-Regex (https://nlp.stanford.edu/software/tregex.shtml <https://nlp.stanford.edu/software/tregex.shtml>) and the Tiger-xml format with Tigersearch (http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/tigersearch.html <http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/tigersearch.html>)

The corpus is free of charge for projects with research purposes. If you want to download this new version, you need to fill the online form (http://ftb.linguist.univ-paris-diderot.fr/telecharger.php?&langue=en <http://ftb.linguist.univ-paris-diderot.fr/telecharger.php?&langue=en>) and accept <> the new terms and conditions of use <http://ftb.linguist.univ-paris-diderot.fr/treebank.php?fichier=cgu> for research purposes only.

If you want a commercial licence, please contact directly ftb at linguist.univ-paris-diderot.fr <mailto:ftb at linguist.univ-paris-diderot.fr>. Before any request, it is possible to test a 100 sentences sample accessible on the website.

Technical specifications:

· Version 1.0, April 3rd, 2017

· 21.550 sentences from the daily Le Monde (1990-1993) : extracts from 1143 articles

· 664.500 tokens

· 44 files; formats: XML, Tiger-XML, PTB and CoNNL

· Metadata (211 authors, date, 14 domains)

· Lexical annotations (categories, subcategories, inflections, compounds with components)

· Syntactic annotations (main constituents, grammatical functions)

· Annotations corrected and validated by hand

Website: http://ftb.linguist.univ-paris-diderot.fr/ <http://ftb.linguist.univ-paris-diderot.fr/> Version history

Note the 1.0 release is the first full release, in which for all sentences all functional and morpho-syntactic tags are available. Before, several beta versions have been released, in which only a subset of the sentences contained grammatical functions, for example:

2005: whole set of sentences used by Arun et al. 2005 , without any functional annotations; 2007: version with 12.531 sentences with functional annotations, used e.g. in Candito et al. 2010 ; 2010: version with 15.922 sentences with functional annotations, used e.g. in Green et al., 2011 ; 2013: version with 18.535 sentences with functional annotations, used for the SPMRL 2013 shared task (Seddah et al. 2013 ).

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 15393 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170420/5781f7f0/attachment.txt>



More information about the Corpora mailing list