[Corpora-List] Using version control software in corpus construction

Steven Bird stevenbird1 at gmail.com
Mon Mar 29 03:48:10 CEST 2010


On 29 March 2010 03:30, Rob Malouf <rmalouf at mail.sdsu.edu> wrote:
> We used version control while building the Alpino corpus/treebank.  It works very well as long as your data and annotations is stored in a text-ish format (like XML).  Version control doesn't work especially well with binary files -- it'll keep track of the latest versions, but it can't track or merge individual changes.

Note that NLTK stores its corpora in svn, in binary format:

http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Its not too wasteful, given that we don't actually curate the content of any corpora. The externally hosted revision control provides a stable way for people to reference previous distributions. Disk space is not an issue, since we only include samples in the case of large corpora (like TIMIT or Europarl).

-Steven Bird



More information about the Corpora mailing list