[Corpora-List] Towards an open-source French tagger

Agata Savary agata.savary at univ-tours.fr
Wed Dec 5 15:47:02 CET 2012

CONCRAFT (http://hackage.haskell.org/package/concraft) is an open source tagger for Polish based on a novel idea of a Constrained Conditional Random Fields model (see [1] for details). It allows to harness the complexity of CRFs by constraining the set of labels for a given token by the output of a morphological analyzer. It outperforms existing taggers for Polish, notably with respect to unknown words.

We are planning to explore CONCRAFT's adaptability to an inflected language of a different family. Thus, we are looking for: - a morphologically annotated corpus of French (preferably with both parts-of-speech and morphological features such as gender, number, tense, etc.), - a large-coverage morphological analyser whose tagset would be equivalent to the corpus tagset, - other freely-available taggers for French in view of a contrastive analysis.

A French version of CONCRAFT obtained in this experiment would be distributed under an open license (probably BSD).

[1] Jakub Waszczuk "Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language", in Proceedings of COLING 2012, Mumbai, India.

