[Corpora-List] Named Entity Corpora in Dutch

Mikhail Kozhevnikov amblerr at gmail.com
Thu Nov 8 00:28:32 CET 2012


Dear Martin,

To my knowledge even the bits already annotated are not available yet, as the data has not been officially released. I've tried to obtain the SRL annotations described in this paper<http://lt3.hogent.be/media/uploads/publications/2012/FinalSRL.pdf> in the end of September and got the following reply:

The SRL annotations are not part of the second release of the intermediate
> SoNaR results. The final release will comprise SRL annotations: a 500K
> corpus that has been automatically labeled and a 500K corpus that has been
> completely manually verified.
> We do not know when the final release will be available, since the project
> is still not officially closed: an evaluation has shown that some
> alterations need to be made and documentation needs to be added. We can not
> start distribution before the official ending of the project.

I too would be very interested in any new information concerning the release date or (partial) availability of the data.

Regards, Mikhail

On Wed, Nov 7, 2012 at 9:28 PM, Martin Reynaert <reynaert at uvt.nl> wrote:


> Dear Ivelina,
>
> For Dutch we now have the SoNaR-500 corpus (currently about 540 million
> word tokens of contemporary written Dutch, automatically annotated) and the
> SoNaR-1 corpus (about 1 million word tokens of contemporary written Dutch,
> largely manually annotated for semantics).
>
> For Named Entity Recognition the Support-Vector Machine tool (called
> 'NERD' for 'Named Entity Recognition for Dutch', developed at LT3, Ghent
> University, by Bart Desmet) used to automatically label SoNaR-500 was
> trained on the NEs manually labeled in SoNaR-1.
>
> To acquire the corpus, please enquire at the Dutch HLT Agency:
>
> http://www.inl.nl/tst-**centrale/ <http://www.inl.nl/tst-centrale/>
>
> The full corpus itself may not be fully available yet, but should be soon,
> and you can at least sort out the licensing part at this stage. In fact, I
> am to date curating parts of its metadata.
>
> Best,
>
> Martin
>
>
>
>
>
> On 11/07/2012 06:23 PM, Ivelina Nikolova wrote:
>
>> On 11/07/2012 05:49 PM, Alberto Lavelli wrote:
>>
>>> The CoNLL 2002 shared task concerned Named Entity Recognition for
>>> Spanish and Dutch.
>>> You can find information about the CoNLL series here:
>>>
>>> http://ifarm.nl/signll/conll/
>>>
>>> Hope this helps
>>>
>>
>> Thanks Alberto!
>> I got several references to this task corpus especially. It seems to be
>> the most used one.
>>
>> Best,
>> Ivelina
>>
>>
>>
>>> alberto
>>>
>>>
>>> On Wed, Nov 07, 2012 at 04:13:07PM +0200, Ivelina Nikolova wrote:
>>>
>>>> Dear Corpora Members,
>>>>
>>>> I am searching for corpora in Dutch with Named Entity annotations.
>>>> I'm interested in Person, Location, Organization and Event mentions.
>>>> Do you have any suggestions on that?
>>>>
>>>> Thank you very much!
>>>> Ivelina
>>>>
>>>> --
>>>> Ivelina Nikolova
>>>> PhD student in Computer Science
>>>> Linguistic Modelling Department
>>>> Institute of Information and Communication Technologies
>>>> Bulgarian Academy of Sciences
>>>>
>>>>
>>>> ______________________________**_________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>>>>
>>>
>>
>>
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5091 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121108/3220dfda/attachment.txt>



More information about the Corpora mailing list