[Corpora-List] Lemmatizing German text for lexical purposes

Helmut Schmid schmid at ims.uni-stuttgart.de
Tue Jan 17 10:29:09 CET 2012


Dear Ciarán,

as Heike already said, SMOR might be interesting for you. SMOR should be able to solve most of the problems you mentioned. Here are some examples:


> Aufsteigen
auf<VPART>steigen<V><SUFF><+NN><Neut><Nom><Sg> auf<VPART>steigen<V><SUFF><+NN><Neut><Dat><Sg> auf<VPART>steigen<V><SUFF><+NN><Neut><Acc><Sg> // nominalisation of a particle verb


> verkleinertes
verkleinern<V><PPast><SUFF><+ADJ><Pos><Neut><Nom><Sg><St> verkleinern<V><PPast><SUFF><+ADJ><Pos><Neut><Acc><Sg><St> // adjectivisation of a past participle


> Ähnliches
ähnlich<ADJ><SUFF><+NN><Neut><Nom><Sg><St> ähnlich<ADJ><SUFF><+NN><Neut><Acc><Sg><St> // nominalisation of an adjective


> Morphologiesysteme
Morphologie<NN>System<+NN><Neut><Dat><Sg><Old> Morphologie<NN>System<+NN><Neut><Nom><Pl> Morphologie<NN>System<+NN><Neut><Gen><Pl> Morphologie<NN>System<+NN><Neut><Acc><Pl> // compound

You could even approach the separable verb prefix problem by attaching the separated prefix to the verb and analysing it. Take the sentence "Er schlägt das Buch auf". You extract "schlägt" and "auf" and analyse the recombined wordform:
> aufschlägt
auf<VPART>schlagen<+V><3><Sg><Pres><Ind>

SMOR is not freely available yet, but you can obtain a free research license.

Best regards,

Helmut Schmid

Am 16.01.2012 22:07, schrieb Ciarán Ó Duibhín:
> Are there any lemmatized corpora of German, which can be used queried
> on-line or on Windows? I'm trying to lemmatize some German text
> myself for lexical purposes, and I would like to see how others have
> handled the problems, and how well it works.
>
> Of the German corpora I have found, Negra is POS-tagged but not
> lemmatized, while Tiger is both POS-tagged and lemmatized. Negra does
> not mention any query facility; Tiger had one which is no longer
> supported and unfortunately doesn't work for me. A problem for me
> with both these corpora is that the tagset they use (STTS) seems to
> be designed with syntax in mind. Here are some examples where this
> may not suit my lexical purposes.
>
> 1. The various finite forms of a verb (eg. aufsteigen) are lemmatized
> to the infinitive and tagged VVFIN, whereas the abstract noun (das
> Aufsteigen) is tagged NN. I think I would like to be able to retrieve
> them all together, eg. in response to "aufsteigen".
>
> 2. Present participles and past participles are tagged as adjectives
> (ADJA or ADJD). I think I would like to retrieve these too from the
> verbal infinitive.
>
> 3. Substantivised adjectives are tagged as nouns (eg etwas
> Ähnliches). I think I would like these retrieved along with the forms
> of the adjective (ähnlich).
>
> 4. Separable verbs are tagged as two words when separated and as one
> word when not separated. I think I would like to retrieve separated
> and nonseparated examples together, though I have not decided whether
> this is best done by tagging them all as one word or as two.
>
> 5. Compound forms are not decompounded. I think I would like to
> decompound (most of) them.
>
> Although my interest is in lemmas, it is sometimes useful for me to
> have POS-tags also, eg. to distinguish arm-ADJ from Arm-NN.
>
> I have run my text through TreeTagger, using the training data for
> STTS, and expect to have to make the above changes manually. Before
> committing myself further, I'd like to try out anything which already
> exists, or to receive any advice.
>
> Many thanks,
> Ciarán Ó Duibhín.
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6484 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120117/f3e807bc/attachment.txt>



More information about the Corpora mailing list