[Corpora-List] Summary: fast string replacement
joerg.schuster at gmail.com
Tue Mar 15 14:14:00 CET 2005
thanks to all who participated in this discussion.
First I have to apologize for my original posting (or mail?): I asked
for programs for transducing strings. I wrote 'strings (!)' to
indicate that I really meant strings (and not regular expressions or
tokens). Yet, the examples I gave mislead some people because they did
not include cases of transduction of multi word lexemes.
In the remainder of this paper I will give an overview of the
suggested solutions. The solution that I like best is Paul Bijnens' C
For shortness, I will mostly leave away the names of the people who pointed
me to the sites.
(1) Max Silberztein: http://www.nyu.edu/pages/linguistics/intex/
(2) Helmut Schmid: http://www.ims.uni-stuttgart.de/~schmid
(3) Stephan Kanthak:
(4) Gertjan van Noord: http://grid.let.rug.nl/~vannoord/Fsa/fsa.html
(5) Arnaud Adant: http://membres.lycos.fr/adant/tfe/
(6) ISI: http://www.isi.edu/licensed-sw/carmel/
(7) MIT: http://people.csail.mit.edu/people/ilh/fst/
Comments: (1)-(6) all look like really serious programs. Yet, I
considered them to be too complicated for my purposes.
(7) is not available at the moment.
(8) ?: ftp://ftp.gnu.org/non-gnu/flex/
Comment: good, but overkill for my purposes.
(9) Songlin Piao pointed me to a java tool of his:
Comment: I tried to use it, but it did not work:
$ java -jar mlct_concordance.jar
$ Invalid or corrupt jarfile mlct_concordance.jar
(10) Leif Arda Nielsen gave me the advice to use sed.
Comment: too slow.
(11) Damon Allen Davison gave me the advice to use SQL.
Comment: I did not quite understand Damon's mail.
(12) Paul Bijnens pointed me to a c program of his:
Comment: This program is great.
- It worked immediately. (No fumbling around with paths,
(versions of) compilers and the like.)
- It doesn't seem to care about the size of the rewrite
dictionary (except that you need to have enough RAM, of course)
- It is quite fast: I gave it a rewrite dictionary of 1 million
entries. It transduced about 50MB per minute on an Athlon 2600+.
More information about the Corpora-archive