You might be interested in PAULA XML, which is a standoff annotation format relying on xpointers to connect annotations to an underlying text:
The text can contain HTML, and not all of the text needs to be tokenized (i.e. it is possible to use xpointers to delimit content words as tokens, while ignoring HTML around them, or not). Another solution is to use TEI XML stand-off markup. You might be interested in these two papers about that:
Piotr Bański and Adam Przepiórkowski. 2009. Stand-off TEI Annotation: The Case of the National Corpus of Polish. In Proceedings of the Third Linguistic Annotation Workshop (LAW), at ACL-IJCNLP 2009. Suntec, Singapore, 64–67.
Piotr Bański. 2010. Why TEI stand-off annotation doesn't quite work and why you might want to use it nevertheless. Balisage: The Markup Conference 2010.
Hope this helps,
Dr. Amir Zeldes
Asst. Prof. of Computational Linguistics
Department of Linguistics
1437 37th St. NW
Washington, DC 20057
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Nikola Milosevic Sent: Monday, May 22, 2017 7:02 AM To: corpora at uib.no Subject: [Corpora-List] Schema and tools for annotating XML documents
I was wondering do anyone knows of any schema that allows annotation of XML documents with stand-off annotations and maybe tool that allows it? Particularly I would need something like that for annotating tables, and it should save somehow structure. I was working on some proposal that uses XPath to save the structure and location (can be seen here: https://gist.github.com/nikolamilosevic86/c94382d4b52705e9ae75dab0eda6381e). Does anyone know of anything similar?
Image removed by sender.ᐧ
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Image removed by sender.
Virus-free. <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> www.avg.com
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10002 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170522/a8c553fb/attachment.txt> -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170522/a8c553fb/attachment.jpg> -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 350 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170522/a8c553fb/attachment-0001.jpg>