I made a few remarks during the ACL 2011 business meeting in favor of the innovation this year on allowing submissions of data and code along with paper submissions. I suggested this is something we want to continue and encourage, particularly for papers submitted to the empirical track at ACL (which is the majority of papers these days) so that we might be able to reproduce results more easily. I had some slides prepared that I didn't use, but I've put those here that summarize part of what I said at least (I forgot a few points, but the gist is fairly consistent I guess...):
There were quite a few comments thereafter and I took a few notes, and I guess I thought it would be possibly useful to preserve these "for the record" at least, since I think that discussion raised many of the common concerns about this issue. It might also be an opportunity for folks to follow up or at least continue thinking.
Below are the comments, approximately in the order made....note that I'm trying here to simply reproduce the gist of comments, and not offer any opinion on them. I think it was great there was such an extensive discussion, and I guess I just wanted to note that and preserve it as best I could. If anyone feels like they have been misquoted, forgotten, or misunderstood, please feel free to jump in and elaborate.
0) Speaker was in support of the encouraging more submissions of code and data, and noted that he was happy to see quite a few presentations at ACL where code and data were being made available.
1) Data is sometimes expensive to create (especially speech data) and releasing it after one publication may not be in the best interests of the creators.
2) Reviewing code is time consuming (and another concern raised during the business meeting was reviewer overload, so this certainly fit into that theme).
3) It is often hard or impossible for people in industrial settings to release code - the licensing issues are sometimes very complex and would need to be resolved before any code was submitted.
4) There could be a prize offered for the best code / best data submitted .
5) It is hard to know how to review software.
6) Maybe software could be made available on an ACL cloud, in order to solve some licensing concerns (especially of industry)
7) Code at submission time is very hard to anonymize - maybe we need separate reviewers for code and data (from paper).
8) Simply releasing or submitting code isn't necessarily useful (if it is bad code). How do we make sure the code is of high quality and/or useful?
9) There is a tension between having new and exciting ideas and producing well engineered code. Put another way, there's a tension between pushing the envelope and playing it safe. The speaker was concerned we might be moving too far away from encouraging new ideas.
10) Releasing code will in the end help the impact of work. If you look at high impact work in our field, it often centers around a resource (eg Penn Treebank). Releasing code can also help people in industry, because sometimes publishing code is the only way that it will ever get out (eg sentence alignment code from CL in 1993 by Gale and Church)
11) Have a retroactive prize after a few years for software systems that are released and are proven to have some impact.
12) During the discussion of the new journal, it was mentioned that maybe that could be a vehicle for releasing code and data.
I'm grateful that the ACL opened up the business meeting to these kinds of remarks, and really appreciate both the opportunity to say a few words, and also hear all these different views. It's given me a lot to think about, and I just wanted to pass along my notes in the hopes of encouraging others to do the same. Keep talking. :)
-- Ted Pedersen http://www.d.umn.edu/~tpederse -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4629 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110703/5b391b65/attachment.txt>