[Corpora-List] Lemmatizing German text for lexical purposes

Eckhard Bick eckhard.bick at mail.dk
Tue Jan 17 11:31:38 CET 2012


Hello Ciarán,

There are a number of lemmatized and morphosyntactically annotated German corpora in the CorpusEye collection, with an online search interface. The link for German is: http://corp.hum.sdu.dk/cqp.de.html, with about 85 million words. The corresponding live analysis is at http://beta.visl.sdu.dk/visl/de/parsing/automatic/parse.php.

Best regards, Eckhard

On 2012-01-16 22:07, Ciarán Ó Duibhín wrote:
> Are there any lemmatized corpora of German, which can be used queried
> on-line or on Windows? I'm trying to lemmatize some German text
> myself for lexical purposes, and I would like to see how others have
> handled the problems, and how well it works.
> Of the German corpora I have found, Negra is POS-tagged but not
> lemmatized, while Tiger is both POS-tagged and lemmatized. Negra does
> not mention any query facility; Tiger had one which is no longer
> supported and unfortunately doesn't work for me. A problem for me
> with both these corpora is that the tagset they use (STTS) seems to
> be designed with syntax in mind. Here are some examples where this
> may not suit my lexical purposes.
> 1. The various finite forms of a verb (eg. aufsteigen) are lemmatized
> to the infinitive and tagged VVFIN, whereas the abstract noun (das
> Aufsteigen) is tagged NN. I think I would like to be able to retrieve
> them all together, eg. in response to "aufsteigen".
> 2. Present participles and past participles are tagged as adjectives
> (ADJA or ADJD). I think I would like to retrieve these too from the
> verbal infinitive.
> 3. Substantivised adjectives are tagged as nouns (eg etwas
> Ähnliches). I think I would like these retrieved along with the forms
> of the adjective (ähnlich).
> 4. Separable verbs are tagged as two words when separated and as one
> word when not separated. I think I would like to retrieve separated
> and nonseparated examples together, though I have not decided whether
> this is best done by tagging them all as one word or as two.
> 5. Compound forms are not decompounded. I think I would like to
> decompound (most of) them.
> Although my interest is in lemmas, it is sometimes useful for me to
> have POS-tags also, eg. to distinguish arm-ADJ from Arm-NN.
> I have run my text through TreeTagger, using the training data for
> STTS, and expect to have to make the above changes manually. Before
> committing myself further, I'd like to try out anything which already
> exists, or to receive any advice.
> Many thanks,
> Ciarán Ó Duibhín.
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- Eckhard Bick, cand.med., dr.phil. University of Southern Denmark e-mail: eckhard.bick at mail.dk web: http://beta.visl.sdu.dk



More information about the Corpora mailing list