[Corpora-List] Alias Detection Dataset

Ayah Zirikly aya.zerikly at gmail.com
Wed Oct 8 04:42:54 CEST 2014


Dear Ralf,

Thank you very much for providing info regarding JRC-Named. It is actually very helpful in the Named Transliteration task and can be used as extra feature in my system. However what I am looking for is the following: assuming we have: John Smith went to school. Peter loves school. We are trying to check if John Smith and Peter represent the same named entity or no. The idea dataset will be annotated the following: <e1>John Smith</e1> went to school. <e1>Peter</e1> loves school. <e2>Anna</e2> always comes late to classes unlike <e1>Peter</e1>.

Thanks again and I appreciate the help,

Ayah

On Tue, Oct 7, 2014 at 10:11 AM, Ralf Steinberger < ralf.steinberger at jrc.ec.europa.eu> wrote:


> Dear Ayah,
>
>
>
> I am not entirely sure what you mean when writing “without the use of
> external resources”, but you may find that JRC-Names
> <https://ec.europa.eu/jrc/en/language-technologies/jrc-names> can be
> helpful for your task. You can download it and integrate it with your
> application. You find JRC-Names at:
>
>
>
>
> https://ec.europa.eu/jrc/en/language-technologies/jrc-names .
>
>
>
> JRC-Names is a collection of several hundred thousand names and their
> variant spellings, including across scripts and languages. The name
> spellings were found by analysing almost 200,000 multilingual online news
> articles per day and by automatically merging spelling variants with
> previously known name spellings. For example, you will find
>
>
>
> Wladimir Putin,
>
> Vladimir Poutine,
>
> Vladímir Putin,
>
> Vlagyimir Putyin,
>
> فلاديمير بوتين
>
> and more as variant spellings of
>
> Владимир Путин.
>
>
>
> JRC-Names is updated daily with new names and name variant spellings
> found. JRC-Names is a by-product of the Europe Media Monitor
> <http://emm.newsbrief.eu/overview.html> family of applications.
>
>
>
> Of course you can take the full names apart in order to work with name
> parts only (e.g. *Vladimir*).
>
>
>
> I hope you find this resource useful.
>
>
>
> All the best,
>
>
>
> Ralf
>
>
>
>
>
> *From:* corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On Behalf
> Of *Ayah Zirikly
> *Sent:* 07 October 2014 15:48
> *To:* corpora at uib.no
> *Subject:* [Corpora-List] Alias Detection Dataset
>
>
>
> Hi,
>
>
>
> I am trying to find datasets that handle alias detection (preferably in
> microblogs). The task I am interested in is given free text, find the
> aliases of a named entity without the use of external resources. It doesn't
> have to be person names, it can be organization or any type of a named
> entity.
>
>
>
> Thanks a lot,
>
>
>
> Ayah
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8234 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141007/cbea0737/attachment.txt>



More information about the Corpora mailing list