In our projects, one factor to organize the corpus architecture is to try to separate the parts that change the most often from the parts that don't change much (for example several tags - from different taggers and tag sets - from the surface of texts in NLP projects). For this, we use various XML standoff annotations techniques. We also use the one word by line technique for some part of our workflows (aka IMS CWB source format).
> it is crucial to document the format as well as you are able,
> and make clear statements about what the annotations are supposed to
> mean.
We use the guidelines of, and participate to, the Text Encoding Initiative (TEI) community : http://www.tei-c.org, which documents corpora sources for that exact purpose since 1994. If you feel NLP data is not very well represented in that standard, you are welcome to propose new encodings and discuss their adoption in the annual update of the guidelines. For example, we are in a process of proposing new encodings to document all the history of the various command line tools that were called during the preparation of a corpus (tokenizers and their parameters, taggers, etc.). We would like our tools to be able to read that history for their own processing needs. Documenting is a must, but sharing that documentation between persons and softwares is a must also.
--Serge Heiden
-- Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lsh.fr ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2436 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20100329/563383f0/attachment.txt>