[Corpora-List] Using version control software in corpus construction
stevenbird1 at gmail.com
Mon Mar 29 03:48:10 CEST 2010
On 29 March 2010 03:30, Rob Malouf <rmalouf at mail.sdsu.edu> wrote:
> We used version control while building the Alpino corpus/treebank. It works very well as long as your data and annotations is stored in a text-ish format (like XML). Version control doesn't work especially well with binary files -- it'll keep track of the latest versions, but it can't track or merge individual changes.
Note that NLTK stores its corpora in svn, in binary format:
Its not too wasteful, given that we don't actually curate the content
of any corpora. The externally hosted revision control provides a
stable way for people to reference previous distributions. Disk space
is not an issue, since we only include samples in the case of large
corpora (like TIMIT or Europarl).
More information about the Corpora