[Corpora-List] Java-based chunk parser?

Sérgio Matos aleixomatos at ua.pt
Thu Nov 26 11:20:18 CET 2009


Hi, Maybe the UIMA Regular Expression Annotator is a good option for this: http://incubator.apache.org/uima/sandbox.html#regex.annotator

Sérgio

-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Edward Ivanovic Sent: 26 November 2009 04:55 To: corpora at uib.no Subject: [Corpora-List] Java-based chunk parser?

Dear Colleagues,

I'm looking for a Java-based tool that will let me define a simple grammar based on regular expressions to parse a given string. For example:

"2q34w-6q8w 5q8w-11q87w" (etc)

Could be parsed by the following rules:

A: \d+q B: \d+w C: <A>\-<B> D: (<C> )+

mixing regex with my own labels (A,B,C,D). The actual syntax for the rules isn't important.

Parsing D will then give me the groupings for C (of which there will be two), and access to the other labels.

Something like the RegexpChunkParser in NLTK does this very well, but I can't use Python for this (needs to be Java), so was hoping someone would know of something before I write my own.

Many thanks, Edward

_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list