[Corpora-List] Q: Hyphenation removal

Roland Schäfer roland.schaefer at fu-berlin.de
Thu Aug 16 13:37:53 CEST 2012

Dear list members,

are there any tools to remove hard-coded "hyphe- nation" from texts (or papers describing principled solutions to the problem). The tool/solution should ideally:

* work even if the line break after the hyphen has been removed,

* differentiate with near-perfect accuracy between actual hyphenation and other superficially identical graphematic constructions involving hyphens (like German truncated compound coordination as in "Bus- und Bahnticket" for "Busticket und Bahnticket"),

* (consequently:) be trainable (ideally in an unsupervised way) on arbitrary languages which use some specifiable set of characters to indicate hyphenation,

* possibly also detect cases of hyphenation which are not written with a space (as in "hyphe-nation") or additional spaces (as in "hyphe - nation"),

* process UTF-8 or at least some Unicode encoding,

* be open-source/patent-free and available in the form of a library or command line tool (e.g., not a GUI tool/part of some OCR product/web service).

Of course, references which only partially match these requirements are also highly appreciated.

Thanks a lot.

Regards Roland

More information about the Corpora mailing list