[Corpora-List] fast string replacement

Paul Bijnens paul.bijnens at xplanation.com
Tue Mar 15 09:56:00 CET 2005


Jörg Schuster wrote:


> I mean really REALLY fast. The size of my rewriting dictionary is 1

> million lines at the moment. (But it will grow larger). The size of my

> corpus is 80GB. And I would like to be able to tag often.


Attached you'll find a little C-program that replaces fixed strings,
that I wrote about 15 years ago. I'm still using it however.

[ attachment: http://torvald.aksis.uib.no/corpora/repl.zip ]

I've never tried it on a replacement set of 1 million lines,
but I'm very interested to see how it behaves on such large input. :-)

There is no man page, but in the source there is some more information.

Quick getting started:

make a file having the following syntax:

====cut here=====
# This is a comment
/search/replace/

# the longest search string will be replaced
/searchsomethingelse/replace this too/

# blank lines are ignored

# The first non-alfabetic char is the separator:
!/this/contains/slashes!/THIS/CONTAINS/SLASHES/!

# A search or replacement string can contain newlines
# or any bytes (includeing null, better encode this \000)
/some
line/some line/

/need to split/need
to split/

# You can encode bytes with backslash notation like
# \n, \t, ...etc, \007 (octoal) or \xC4 (hexadecimal)
/élève/\xe9l\xe8ve/
========== cut here ===========


Execute with:

$ repl /name/of/repl/table infile > outfile

You can also specify replacements on the command line:

$ repl -e '/\r\n/\n/' infile > outfile


At least the program is very simple... (and fast for me!)

If really needed, the tree implementation could be replaced
by a trie implementation to make it even faster, at the expense of
being more complicated (that's probably what the commercial progs do).


--
Paul Bijnens, Xplanation Tel +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512
http://www.xplanation.com/ email: Paul.Bijnens at xplanation.com
***********************************************************************
* I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, *
* kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ... "Are you sure?" ... YES ... Phew ... I'm out *
***********************************************************************






More information about the Corpora-archive mailing list