[Corpora-List] fast string replacement

Anssi Yli-Jyra aylijyra at ling.Helsinki.FI
Fri Mar 11 21:30:01 CET 2005


On Fri, 11 Mar 2005 js at cis.uni-muenchen.de wrote:

> I am looking for a program that

> - takes as input a string (!) rewriting dictionary and and a corpus

> - applies all rewriting rules to all lines of the corpus

> - is fast, stable and free

> - works under Linux


The fastest tool around is LEX or its newer version FLEX available
in all Linuxes. It can take a list of patterns and the associated
print statements and it compiles it into an C/C++ program that
does the between std input and std output. When used carefully
it can be almost as fast as unix word count program (wc), so it is
very fast.

Lex looks for the longest leftmost match and then applies the cases where
you can print a replacement string. All the rules are matched in
parallel, but you can also define several "states" that indicate
which subsets of the rules are being used.

I would say that the best tool for many (>500) strings and long
(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....) strings matching is
the Beta program, but I do not know how free it is. Lingsoft sells
commercial licenses. It's a quite old program but uses state machines and
packed transitions very efficiently and should not be kept in mind when
considering such tools. I used it when Lex (or Gnu Lex=Flex) could not
compile its rules into automa. Typically the limit of Flex is somewhere
between 500 rules after which the machine grows too big.

If you want full transducers, try RWTH FSA utilities. It is free and
very efficient.

-- A Yli-Jyrä






More information about the Corpora-archive mailing list