[Corpora-List] Available: German Named Entity Recognition resources

Sebastian Padó pado at ims.uni-stuttgart.de
Thu Jun 24 11:29:16 CEST 2010


Dear all,

We are glad to announce two new resources for German Named Entity Recognition that are freely available for research purposes.

The first resource is a German classifier for the CRF-based Stanford NER system that has been trained on the German CoNLL 2003 dataset. It distinguishes four classes of NEs: person, location, organization, other. It includes features based on lexical clusters obtained from a large (175M tokens) corpus of unlabelled German text, which improves recall by up to 10%.

The second resource consists of two EUROPARL transcripts annotated with Named Entities using the same scheme. The total size is about 110,000 tokens.

According to our evaluation, the classifier is currently among the best NER systems for German.

Condition | Test set | Prec | Rec | F-1 --------------------------------------------------------- In-domain | (CoNLL 2003 testb)| 86.6 | 71.2 | 78.2 Out-of-domain | (EUROPARL) | 78.0 | 56.7 | 65.6

For more information and downloads, please visit http://nlpado.de/~sebastian/ner_german.html

Sincerely,

Manaal Faruqui & Sebastian Pado IMS, University of Stuttgart



More information about the Corpora mailing list