[Corpora-List] fast string replacement

Paul Bijnens paul.bijnens at xplanation.com
Tue Mar 15 09:56:00 CET 2005

Jörg Schuster wrote:

> I mean really REALLY fast. The size of my rewriting dictionary is 1

> million lines at the moment. (But it will grow larger). The size of my

> corpus is 80GB. And I would like to be able to tag often.

Attached you'll find a little C-program that replaces fixed strings,
that I wrote about 15 years ago. I'm still using it however.

[ attachment: http://torvald.aksis.uib.no/corpora/repl.zip ]

I've never tried it on a replacement set of 1 million lines,
but I'm very interested to see how it behaves on such large input. :-)

There is no man page, but in the source there is some more information.

Quick getting started:

make a file having the following syntax:

====cut here=====
# This is a comment

# the longest search string will be replaced
/searchsomethingelse/replace this too/

# blank lines are ignored

# The first non-alfabetic char is the separator:

# A search or replacement string can contain newlines
# or any bytes (includeing null, better encode this \000)
line/some line/

/need to split/need
to split/

# You can encode bytes with backslash notation like
# \n, \t, ...etc, \007 (octoal) or \xC4 (hexadecimal)
========== cut here ===========

Execute with:

$ repl /name/of/repl/table infile > outfile

You can also specify replacements on the command line:

$ repl -e '/\r\n/\n/' infile > outfile

At least the program is very simple... (and fast for me!)

If really needed, the tree implementation could be replaced
by a trie implementation to make it even faster, at the expense of
being more complicated (that's probably what the commercial progs do).

Paul Bijnens, Xplanation
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512
http://www.xplanation.com/ email: Paul.Bijnens at xplanation.com
