[Corpora-List] Using version control software in corpus construction

chris brew cbrew at acm.org
Mon Mar 29 05:37:13 CEST 2010


On Sun, Mar 28, 2010 at 12:21 PM, Rich Cooper <rich at englishlogickernel.com> wrote:
> Version control is for text files that change during development.  If you
> put all your markup information into the text file with the actual text, it
> would be encoding your versions of markup as well as your corpora of
> phrases.

I agree that version control is highly suitable for text files that change during a development process. I also agree with the implicit suggestion that keeping markup and text in the same file is not always the best idea.

In addition, publicly accessible repositories, under version control, are a very good way of ensuring that a wider user community can access the whole development history of a corpus. This is valuable, because any corpus that is worth its salt will be used for studies at various stages during its development, and we should want to retain reproducibility even when a newer version comes along.

Public repositories are also desirable because it is generally good for possible imperfections in the corpus to be exposed to as many people as possible. Corpus developers, like software developers, should be keen for their bugs to be fixed by others. This is a robust finding for software. Many software developers are now accustomed to the slightly queasy feeling of putting stuff out their despite its probable imperfections, and have found that the benefits of exposure justify the risk.

This open-source model is not so attractive if you are constrained by copyright or by institutional policy to NOT make the corpus fully available in an open-source form. In that case you might still want to use version control, but in a private repository. And perhaps to agitate for the copyright release or change in institutional policy that would allow you to fully benefit from the help of others.

I'm neutral on whether the format of the corpus should be defined with XML schemas, SQL, or something else, but insistent on the merits of defining it in some way that is amenable to automated checking, and available for extension and modification by others. It isn't necessarily crucial to get everything about the format right from the outset, it is crucial to document the format as well as you are able, and make clear statements about what the annotations are supposed to mean. The fact that LDC did this documentation task well with the Penn Treebank is the reason why others have been able to use, extend and transform it in all kinds of interesting ways. And also to find errors and inconsistencies in it. If we didn't know what the data was supposed to be like, we'd have no chance of telling when errors were happening.



More information about the Corpora mailing list