[Corpora-List] Lemmatizing German text for lexical purposes

Ciarán Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Mon Jan 16 22:07:59 CET 2012


Are there any lemmatized corpora of German, which can be used queried on-line or on Windows? I'm trying to lemmatize some German text myself for lexical purposes, and I would like to see how others have handled the problems, and how well it works.

Of the German corpora I have found, Negra is POS-tagged but not lemmatized, while Tiger is both POS-tagged and lemmatized. Negra does not mention any query facility; Tiger had one which is no longer supported and unfortunately doesn't work for me. A problem for me with both these corpora is that the tagset they use (STTS) seems to be designed with syntax in mind. Here are some examples where this may not suit my lexical purposes.

1. The various finite forms of a verb (eg. aufsteigen) are lemmatized to the infinitive and tagged VVFIN, whereas the abstract noun (das Aufsteigen) is tagged NN. I think I would like to be able to retrieve them all together, eg. in response to "aufsteigen".

2. Present participles and past participles are tagged as adjectives (ADJA or ADJD). I think I would like to retrieve these too from the verbal infinitive.

3. Substantivised adjectives are tagged as nouns (eg etwas Ähnliches). I think I would like these retrieved along with the forms of the adjective (ähnlich).

4. Separable verbs are tagged as two words when separated and as one word when not separated. I think I would like to retrieve separated and nonseparated examples together, though I have not decided whether this is best done by tagging them all as one word or as two.

5. Compound forms are not decompounded. I think I would like to decompound (most of) them.

Although my interest is in lemmas, it is sometimes useful for me to have POS-tags also, eg. to distinguish arm-ADJ from Arm-NN.

I have run my text through TreeTagger, using the training data for STTS, and expect to have to make the above changes manually. Before committing myself further, I'd like to try out anything which already exists, or to receive any advice.

Many thanks, Ciarán Ó Duibhín. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3318 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120116/9812f110/attachment.txt>



More information about the Corpora mailing list