[Corpora-List] Apply Coreference Resolution in Wikipedia

Joel Nothman joel.nothman at gmail.com
Sat Apr 21 13:49:58 CEST 2012


Hi Daniel,

General coreference resolution may group all noun phrases into coreferential clusters. The problem is simplified if you're only interested in particular entity types (e.g. people), or particular entities, or if you have additional information about those entities. As such, in Wikipedia, where you have near-gold-standard links to articles about some of the entities mentioned in an article, you can use information contained in the linked article (or contained in other pages with the same link target).

Depending on the needs your task, you may also obtain enough reliable samples by using high-precision methods: ignore pronoun anaphora, and ignore cases where a name may be ambiguous (e.g. "Washington" in an article where both person and city are link targets).

I called this task - matching names in a Wikipedia page to entities linked

from that page - "link inference" in my work on transforming Wikipedia to NER training data (see http://schwa.org/projects/resources/wiki/Wikiner for references). For this application, we could rely on redundancy to discard low-confidence matches.

I applied a simple heuristic solution that was evaluated extrinsically as a variable in the Wiki->NER task: when processing article A, collect all aliases of articles that A links to from various sources which are ranked according to their reliability. Then basically find the longest matching strings preferring more reliable alias information, ignoring some lowercase variants, and discarding conflicts.

For aliases I experimented with: * Article titles * Article redirect titles * Titles and redirect titles of relevant disambiguation pages (important for * Final words in titles of person articles * Text of incoming links

It may also be worth including: * Bold text in the first paragraph of the article * Foreign language equivalent titles

(Using the text of all incoming links without considering frequency is probably a bad idea and consistently reduced my task performance.)

Given that the entire method uses data of questionable reliability and is naively heuristic, this does a reasonable job, but results are far from perfect. In particular, extracting reliable disambiguation data is very difficult.

While it is not feasible for me to send you a corpus of Wikipedia with these links identified, I may be able to send you the extracted alias data, and perhaps the Python script for inferring links. Email me privately if that is of interest.

You may also be interested in the Named Entity Linking (or Disambiguation) literature, which isn't interested in coreference in Wikipedia text, but commonly links to Wikipedia entities. It is therefore also interested in collecting aliases for Wikipedia entities.

Good luck!

Joel Nothman PhD candidate School of IT University of Sydney

On Fri, 20 Apr 2012 20:55:44 +1000, Gerber Daniel <dgerber at informatik.uni-leipzig.de> wrote:


> Hello,
> I'm currently working on a distant supervision approach for relation
> extraction. I'm using the english Wikipedia articles to find sentences
> which contain labels of resources, for example a resource's name like
> "Barack Obama". My problem is now that this string only occurs in the
> first couple of sentences of the article and is then substituted for
> example with pronouns or things like "The president ..." So what I want
> to do, is to apply coreference resolution on the complete english
> Wikipedia (ideally also in other languages like German) and replace
> those substitutions with the resource name.
>
> Is there a corpus like this already available? If not, would I need to
> write this myself (using some lib) or are there applications available
> which are able to do this.
> Also, what would be a good library for this task (speed, accuracy) ? I
> came across Illinois Coreference Package, StanfordNLP, OpenNLP, Illinois
> but I can't afford to try them all. :/
>
> I would be very happy for some suggestions!
>
> Kind regards,
> Daniel
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list