[Corpora-List] Thesaurus recommendation.

Trevor Jenkins trevor.jenkins at suneidesis.com
Mon Apr 18 10:54:59 CEST 2011

On Mon, 18 Apr 2011, Hong-woo Chun <hongwoo.chun at gmail.com> wrote:

> I'm searching for thesauri.
> I would like to extract IS-A/PART-OF relations from the texts using BT-NT
> pairs in thesaurus. It's not depend upon any domain. Domain-Free!!!
> Currently, UMLS, Compendex have been manually categorized based on the
> corresponding relations.
> I'm trying to find out thesauri, but most of them are w.r.t. Biological or
> Biomedical domains.
> Are there good thesauri w.r.t. any Scientific domains?

There are a couple of thesauri compiled by the British Museum. Their Object Names and Materials thesaurii are online at Collections Trust. However, they are difficult to locate; CT's intra-site search feature is borked. Check out http://www.collectionstrust.org.uk/bmobj/Objintro.html and http://www.collectionstrust.org.uk/bmmat/matintro.html respectively.

The format of these micro-sites is a little bizarre. I once had to process the content from the peculiar HTML used to a format suitable for inclusion in a text retrieval system as a thesaurus. However, some perl/python/ruby coding should extract the terms for you. (Can't give you the code I wrote as it was written for my employer in their time for their client.)

> Please recommend good thesauri.

Because the Collections Trust web site search feature is broken you might wish to to a site specific search in Google

"site:www.collectionstrust.org.uk thesarus"

which could give you upwards of 1,000 further links and thesaurii.

There is a Social History and Industrial Classification (SHIC) thesaurus that was developed in the 1980s by curators from several other major UK museums. I've only ever seen this in a printed edition never online. There was some talk of an update SHIC-2 but the project may not have been started.

There is also MeSH (Medical Subject Headings) from NIH. Again I had to process this back in the days when it was provided on mag tape in variable/fixed-length blocks in US/UK MARC format. I believe that it is now available in XML format. Check out http://www.ncbi.nlm.nih.gov/mesh for further details.

You might also wish to consult, if you have not already done so, ANSI/NISO Z39.19 Guidelines for the Construction, Format and Management of Monolingual Controlled Vocabulary. There used to be free-to-download copies of this available at the NISO web site but it appears now to be a ``for purchase'' item. This standard used to be identical to ISO 2788:1986 and the various other national standards making bodies equivalent texts. However, Z39.19 looks to have been updated in 2005 so the texts may have diverged. ISO has a multi-lingual standard ISO 5964:1986.

Regards, Trevor

<>< Re: deemed!

More information about the Corpora mailing list