[Corpora-List] Roget's Thesaurus as an Electronic Lexical Knowledge Base

Stan Szpakowicz szpak at site.uottawa.ca
Mon Jun 12 23:23:01 CEST 2006

Roget's Thesaurus as an Electronic Lexical Knowledge Base


Roget's Thesaurus in Java, designed for Natural Language Processing, is
now available for downloading. We distribute it under the GNU General
Public License. The system is the graduate work of Mario Jarmasz
<http://www.site.uottawa.ca/~mjarmasz/thesis/>, who implemented it with
the proprietary lexical data in the 1987 Penguin Roget's. Olena Medelyan
<http://www.cs.waikato.ac.nz/~olena/> has wonderfully reengineered
Mario's system with the public-domain 1911 Roget's.

The Roget's ELKB package includes four examples of NLP applications:
detecting lexical chains in text, determining semantic distance between
words and phrases, clustering words based on their meaning and solving a
word quiz.

If you decide to use the ELKB, please put on your Web page a link to the
download site. (See my page home for a nifty logo.)

[The system is perfectly functional, but the 1911 data are antiquated.
We are in discussion with Pearson Education, the owner of the 1987
Penguin Roget's, about the fee structure and distribution mode that
would enable the NLP community to acquire the much more attractive

Stan Szpakowicz, PhD, Professor 613-562-5800/6687 /~\ The ASCII Ribbon
SITE, Computer Science szpak at site.uottawa.ca \ / Campaign Against
University of Ottawa www.site.uottawa.ca/~szpak X HTML Email

More information about the Corpora-archive mailing list