Here's a paper on information retrieval of patents based on named entity recognition of chemicals & converting from a textual to a structural representation:
Text analytics is becoming an increasingly important tool used in biomedical
research. While advances continue to be made in the core algorithms for entity identiﬁcation and relation extraction, a need for practical applications of these technologies arises. We developed a system that allows users to explore the US Patent corpus using molecular information. The core of our system contains three main technologies: A high performing chemical annotator which identi- ﬁes chemical terms and converts them to structures, a similarity search engine based on the emerging IUPAC International Chemical Identiﬁer (InChI) stan- dard, and a set of on demand data mining tools. By leveraging this technology we were able to rapidly identify and index 3, 623, 248 unique chemical struc- tures from 4, 375, 036 US Patents and Patent Applications. Using this system
a user may go to a web page, draw a molecule, search for related Intellectual Property (IP) and analyze the results. Our results prove that this is a far more eﬀective way for identifying IP than traditional keyword based approaches. Kev
On Thu, Feb 26, 2009 at 4:02 AM, Eva D'hondt <e.dhondt at let.ru.nl> wrote:
> We have just started a project here at the Radboud University of Nijmegen
> that deals with Passage Retrieval and Text Mining in patent texts. I was
> wondering if anyone could point me to some literature/research/interesting
> facts on the linguistic and statistical characteristics of the language used
> in patent texts (e.g. frequency and hierarchical organisation of
> PP-attachments, use of gerund clauses vs. the relative clause with an
> inflected verb, average sentence length in the different sections, ... ).
> I will of course post a summary of your replies on this list.
> Thank you ever so much!
> Eva D'hondt, PhD student
> Centre for Language and Speech Technology
> University of Nijmegen
> Email: e.dhondt at let.ru.nl
> Corpora mailing list
> Corpora at uib.no
-- K. B. Cohen Biomedical Text Mining Group Lead, Center for Computational Pharmacology and Lead Artificial Intelligence Engineer, The MITRE Corporation, Human Language Technology Division 303-916-2417 (cell) 303-377-9194 (home) http://compbio.uchsc.edu/Hunter_lab/Cohen -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3329 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090226/e53690e9/attachment.txt