This is funny, we also just started discussing the use of version control systems for a newly started project on data sharing and model building for machine translation (http://www.letsmt.eu/). Managing revisions seems to be very useful for such a collaborative initiative. However, we haven't started implementing our repository yet and we also would like to know about any experience with large-scale data files in SVN or related systems. Here are some questions we like to answer:
* Is it possible to compress internal files in SVN or other systems? (What I mean is that SVN would take care of compression of the internal files in the repository but check-in/check-out works with plain text files)
* Is it possible to remove specific revisions or even to restrict the history to a specific number of revisions? (but I'm not sure if this would be a good idea anyway)
* How efficient is check-in/check-out for large repositories/files?
Any insides/hints (also about other issues) would be much appreciated.
PS: We will be looking for (data) contributions soon ....
On 3/28/10 7:14 PM, Piotr Bański wrote:
> One thing that version control gives you that has not been mentioned so
> far is that it makes it easy to define the state of the corpus as it was
> at the moment you performed calculations that you want to be
> reproducible. Before you perform any measurements, tag the current
> corpus as a 'development snapshot', and it will always be possible to go
> back to it later. This concerns both dynamic/monitor corpora as well as
> static corpora before any corrections are made to their data and/or
> I credit the observation concerning the usefulness (or actually virtual
> necessity, if empiricism is treated seriously) of 'snapshots' to Henry
> S. Thompson in a conference discussion earlier this year (though it
> may/must have been around for some time, I hope...). I'm not sure that
> he meant this in the sense of 'SVN/CVS/whatnot release tags', but
> translating it into version-control-speak is a trivial extension of that
> On 2010-03-28 17:20, Hardie, Andrew wrote:
>> Hi all,
>> I am contemplating using a source-code version control system (such as
>> Subversion) to store the files of a corpus as it is being constructed,
>> (a) to help keep track of changes as I go, (b) to allow several people
>> to work on it in a non-confusing way and (c) to simplify backing up and
>> aid data security.
>> Using version control software occurred to me after spending some time
>> manually keeping track of a set of encoding and markup changes in an
>> older corpus, and finding it a total pain in the neck. Of course, this
>> is not exactly what version control software is designed for...
>> I was wondering, has anyone on the list done this before? If so, are
>> there any pitfalls to avoid / particular pointers I should be aware of?
>> Or alternative (better) ways of accomplishing the same thing?
>> All hints and tips gratefully received.
>> Andrew Hardie
>> Department of Linguistics
>> County South
>> Lancaster University
>> Lancaster LA1 4YL
>> United Kingdom
>> a.hardie at lancaster.ac.uk
>> Corpora mailing list
>> Corpora at uib.no
> Corpora mailing list
> Corpora at uib.no