[Corpora-List] Named Entity Extraction from Noisy, Unstructured Texts

Diane M. Napolitano dmnapolitano at gmail.com
Thu Jun 18 23:32:09 CEST 2009

Hello, everyone! I'm looking for information on named entity recognition from documents that are almost completely unstructured and incredibly messy. I get a lot of documents that are basically text extracted from PDFs, images, PowerPoint slides and the like, and the resulting text is often missing a lot of formatting. I've read a number of papers and I've tried training a statistical package (Stanford) on data of this kind, but it actually performs worse than if I had trained on clean, narrative data. Right now, my group has a rule-based system that relies on gazetteer lists, which only gets us so far...

Anyone have any insights they could provide? :)

Thanks! Diane -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 725 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20090618/7ff3d7b3/attachment.txt>

More information about the Corpora mailing list