[Corpora-List] trying to parse crossing dependencies

Yannick Versley versley at sfs.uni-tuebingen.de
Mon Oct 6 10:28:33 CEST 2008


> I always wondered about how do you parse trennbare Verben in German,
> which to me, even if not the same kind of problem, does relate to
> "crossing dependencies" somehow
Parsing separable verbs, but also topicalized objects in German, or relative clauses in most languages, is usually difficult and people have come up with different solutions which all somehow work but don't really look completely satisfying. * You can just ignore the problem by making everything projective and sort out everything after your parser has done the grunt work. This is the common approach when people do PCFG parsing on e.g. the Tiger treebank - they just reattach the offending nodes to a higher level and pretend nothing has happened. The MALT dependency parser (Nivre&Nilsson 2005) reattaches edges and puts different labels on them so a post-processing step can reattach things to where they came from. This is more honest than the former since, unlike using PARSEVAL metrics on the transformed structures, you can actually compare the parse+postprocessing result directly to what you aimed at. * There is the biting-the-bullet approach of just doing non-projective parsing by either solviong a harder problem or doing a coarser approximation. The WCDG approach (Menzel&Schröder 1998, Foth/Daum/Menzel 2004) simply puts nonprojective parsing into a weighted constraint satisfaction problem, where you can posit arbitrary constraints on parse trees. This usually means that efficiently solving this problem requires a considerable engineering effort. A method using a more general-purpose constraint server can be found in (Riedel&Clarke, 2006) A slightly less ambitious version of this would be (McDonald&Pereira 2006), who first use a projective parsing algorithm to find a reasonable parse and then use hill-climbing to re-attach non-projective edges. And there's also an approach of using an O(n^2) spanning tree algorithm to directly get a non-projective dependency tree, as in (McDonald&Pereira 2005). The obvious problem of this approach is that you cannot represent dependencies between edges; so if you want to keep the algorithm not to attach two determiners or two subject to one word, you have to do this with approximate, indirect means (aka "some fumbling required"). There are people who try to base their formalism on parsing algorithms that cover *exactly* the class of mildly-context-sensitive grammars, which are parsable in O(n^6) worst case time [which is still dreadful, but sounds much nicer than the exponential worst-case in the case of general constraint satisfaction]. See e.g. (Lichte&Kallmeyer 2008) for one approach in German.

For handling separable verbs, there are different approaches: * one would be to get an approximate idea of clause structure (through clause chunking) and gather the parts of the separable verbs to put them together and look them up. This is quite obviously a 90% solution, but it works well enough in cases where you want the verb lemma, but can't be bothered to do full parsing. * In the WCDG approach, you *can* have a constraint that looks at the verb prefix and does the right thing. For practical reasons (as far as I know), the handling of verb valencies is implemented approximately as a kind of "this verb can have a dative object when it occurs without a prefix, and an accusative and/or a dative object"). * In HPSG implementations (e.g. Müller 2002 http://hpsg.fu-berlin.de/~stefan/Pub/paradox.html for an overview) usually treat the separable prefix as an argument to the verb, which means that there are lots of different lexicon entries for one surface verb form (i.e., "hat" has lexicon entries for vor#haben, an#haben, auf#haben, etc.)

Separable prefixes are a different class of problems than the worst non-projectivities (the ones which make the mild context sensitivity necessary), however, since they can be reasonably treated with a (non-Xbar) PCFG grammar.

Best wishes, Yannick Versley -- Yannick Versley Seminar für Sprachwissenschaft, Abt. Computerlinguistik Wilhelmstr. 19, 72074 Tübingen Tel.: (07071) 29 77352

More information about the Corpora mailing list