[Corpora-List] Schema and tools for annotating XML documents

Amir Zeldes Amir.Zeldes at georgetown.edu
Mon May 22 21:18:33 CEST 2017


Hi Nikola,

You might be interested in PAULA XML, which is a standoff annotation format relying on xpointers to connect annotations to an underlying text:

https://www.sfb632.uni-potsdam.de/images/doc/PAULA_P1.1.2013.1.21a.pdf

The text can contain HTML, and not all of the text needs to be tokenized (i.e. it is possible to use xpointers to delimit content words as tokens, while ignoring HTML around them, or not). Another solution is to use TEI XML stand-off markup. You might be interested in these two papers about that:

Piotr Bański and Adam Przepiórkowski. 2009. Stand-off TEI Annotation: The Case of the National Corpus of Polish. In Proceedings of the Third Linguistic Annotation Workshop (LAW), at ACL-IJCNLP 2009. Suntec, Singapore, 64–67.

http://delivery.acm.org/10.1145/1700000/1698392/p64-banski.pdf

Piotr Bański. 2010. Why TEI stand-off annotation doesn't quite work and why you might want to use it nevertheless. Balisage: The Markup Conference 2010.

https://www.balisage.net/Proceedings/vol5/html/Banski01/BalisageVol5-Banski01.html

Hope this helps,

Amir

------------

Dr. Amir Zeldes

Asst. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057

http://corpling.uis.georgetown.edu/amir

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Nikola Milosevic Sent: Monday, May 22, 2017 7:02 AM To: corpora at uib.no Subject: [Corpora-List] Schema and tools for annotating XML documents

Hello,

I was wondering do anyone knows of any schema that allows annotation of XML documents with stand-off annotations and maybe tool that allows it? Particularly I would need something like that for annotating tables, and it should save somehow structure. I was working on some proposal that uses XPath to save the structure and location (can be seen here: https://gist.github.com/nikolamilosevic86/c94382d4b52705e9ae75dab0eda6381e). Does anyone know of anything similar?

Best regards,

Nikola Milošević

Image removed by sender.ᐧ

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Image removed by sender.

Virus-free. <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> www.avg.com

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10002 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170522/a8c553fb/attachment.txt> -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170522/a8c553fb/attachment.jpg> -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 350 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170522/a8c553fb/attachment-0001.jpg>



More information about the Corpora mailing list