[Corpora-List] (no subject)

Mike Maxwell maxwell
Thu Jan 17 01:58:06 CET 2013

On 1/15/2013 12:28 PM, Eirini LS wrote:
> I was a bit confused, when a person who has created an analyzer
> (using xerox calculus, lexc) argued that the module works only
>for analysis, but doesn't generates anything and nobody can use
> it in other direction (using lookup, recall). It is not right to
> read a list that it generates using a command "print lower-words".
>Is it right? How can I check the quality of an analyzer?

Since no one has responded to this, I'll try.

The Xerox Finite State Tools (both lexc and xfst) are inherently bidirectional; if you can analyze words, you can also generate from whatever underlying representation the writer of the parser code has chosen. That is, if 'cats' analyzes as 'cat+PL', then you can input 'cat+PL' in generate mode, and it will give you 'cats'.

What the person you talked to may have been referring to is the fact that (if I'm remembering correctly) the standard version of lexc (and xfst) places a limit on how many "words" it will print with "print words" (I wasn't thinking there was a limit on print-lower-words, but I may be wrong). As I understand it, this has to do with the fact that Xerox was trying to protect its investment in the code that produced upper/lower pairs from a lexicon plus rules--otherwise, you could compile a transducer using lexc and/or xfst, dump the upper/lower pairs, and input those pairs into some simple-to-build and unlicensed FST which had no compilation capability. There was a commercial version of the tools which cost considerably more, and which could be used to build commercial and distributable FSTs. But I am not a lawyer, and my memory of that is fuzzy. If you need more information, you should contact Lauri Karttunen and Ken Beesley, who wrote the book on xfst and lexc (literally and figuratively).

Also, there is now an open source tool, foma, which does most of what xfst did, with the exception of compile-replace (used for some kinds of reduplication); but I believe foma has a work-around for this. The compile-replace algorithm was patented.

Checking the quality of a morph analyzer like xfst/lexc (or any other such tools) is a different question. There are lots of ways to do it; one we used was to run test cases (words to be parsed) through xfst and hand-validate the output. The input/output pairs were stored in a version control system, so as to allow regression testing. There are other ways as well.

For the record, I would not use "print lower-words" for testing the parser, since that doesn't tell you whether you get the *correct* analysis. --

Mike Maxwell

maxwell at umiacs.umd.edu

"My definition of an interesting universe is

one that has the capacity to study itself."

--Stephen Eastmond

More information about the Corpora mailing list