[Corpora-List] XML encoding database of tagged documents
lou.burnard at computing-services.oxford.ac.uk
Mon Jun 5 21:31:01 CEST 2006
I know of no application of TEI which uses more than a very small
proportion of the 600+ elements it defines in total (probably a bit more
than 1%, but certainly less than 10!). The point about the TEI standard
is that it is designed to be modular and customisable, so that you can
use it to develop interchangeable resources. If I've understood your
intended application right, you're talking about a kind of standoff
annotation, which would allow you to create pseudo documents consisting
of pointers into a separate text file: this is what the <span> element
provides (probably not <milestone>s, since they are embedded within the
text itself. A document containing such pointers is still, I think, a
text document, and so can be described by a suitable subset of TEI.
However, we probably shouldn't burden readers of this list with a
theological debate! If you'd like to send me a sample of the kind of
thing you have in mind, I'd be glad to make more concrete suggestions
Another XML based standard you might consider in this context is topic
maps which perform a similar kind of annotation function.
Normand Peladeau wrote:
> Well! TEI is a great standard but is much more that what I need.
> Maybe 99% of what they propose would not be very useful for the kind
> of application I am trying to do.
> I don't need to keep information about the text structure or about
> linguistic or typographic features. The only element that I need to
> keep inside the documents are user defined codes attached to text
> segments. Those codes can be overlapping (the "milestone" element
> proposed by TEI may offer a solution for this, but I'm not entirely
> sure it handles all the situations pretty well, so some tests will be
> needed). As for comments, they are not attached to the document itself
> but to the user defined codes, so I'm not sure they are equivalent to
> TEI <note> element.
> I have some clients in the market research industry and in legal firms
> who are doing manual annotations of documents in databases and are not
> at all interested in the kind of information normally provided by a
> TEI compliant document. What I am looking for is a more basic set of
> XML standards that are used to import and export database containing
> documents (but also numercial data, dates, etc.) and where the only
> relevant elements in the documents are the user defined codes attached
> to text segments (sometimes overlapping).
More information about the Corpora-archive