[Corpora-List] Source code corpora

Klaus Guenther klaus.guenther at split.uni-bamberg.de
Thu Nov 20 20:21:10 CET 2008

The difficulty with large and important projects such as the Linux kernel is that there are few people who are allowed to commit code. If someone has code to submit, they provide it for analysis and once it is accepted, someone with sufficient karma submits it. So it is not possible to simply parse the commit emails by sender to determine the author.

In addition, many of the changes are very minute and coding standards (CS) are rigorously enforced. Therefore, formatting will not be different between developers. Instead, it might be helpful to isolate comments and consider the frequency and style, as these generally do not follow any standard other than, perhaps, the requirement that they be frequent enough to explain the code to a programmer unfamiliar with why it is coded the way it is and what each piece of code does.

Smaller projects may be more interesting, especially where standards are less tightly enforced. Indeed SourceForge and other open repositories provide masses of code that are often written by individuals or small teams, were each programmer commits code directly. I have experience working with one particular open source endeavor, the PEAR project (http://pear.php.net/). The code there is often very diverse, even with a coding standard. Older code is not necessarily updated to reflect changes to the coding standard, and code reuse is very popular. Each individual module (package) is controlled by one or more programmers who are fully responsible for its development. Yet many patches are submitted by developers who merely use the packages, and they are committed by the packages' developers, often being edited.

So the main issue is finding code that can reliably be attributed to an author in an unmodified form and discovering details that are not attributable to the project's coding standard. I know of no such corpus.

Regards, Klaus

--- Klaus Guenther, M.A. University of Bamberg, Germany

Alexandre Rafalovitch schrieb:
> Wouldn't any source code repository with version control system give
> you that automatically? They all tell you exactly which code was
> contributed and by whom.
> E.g. SourceForge, Apache or Linux Kernel collections.
> http://www.koders.com/ might be a good way to search, if you are
> trying to narrow down to a particular area.
> Regards,
> Alex.
> Personal blog: http://blog.outerthoughts.com/
> Research group: http://www.clt.mq.edu.au/Research/
> On Thu, Nov 20, 2008 at 1:28 AM, <sdb at cs.rmit.edu.au> wrote:
>> Dear colleages,
>> My research relates to authorship attribution of source code (that is,
>> determining the owner of anonymous work samples based upon other work
>> samples where authors are known).
>> I'm looking for recommendations for source code corpora for this task
>> for any programming language. For the corpora to be useful, authorship
>> has to be identified.
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list