As an example there was considerable variation in how people evaluated their system, with both a race to the bottom in terms of meaningful evaluation (people wanted to see improvements from noisy world knowledge, so they’d evaluate on gold mentions because that’s the only setting in which that helps a lot), and even people writing ACL papers about how everyone
was doing it wrong weren't above “ choosing not to impute certain errors”.
These are the same issues that also plagued parsing evaluation ca. 1997, and in coreference the SemEval-2010 and CoNLL shared task mean that we now have a dataset that’s fully accessible to everyone (unlike the ACE data, where the testing data was not distributed to participants), with a standardized scorer (written by Emili Sapena for SemEval,with many corrections and improvements from Sameer Pradhan and the others who organized the CoNLL shared task).
So it’s definitely possible to measure the same thing for people, even if it takes some effort.
In NLP, you also want to not only measure the same thing for everyone, but also the right thing - normally people don't want to use a parser because they want to find out exciting new things about PTB section 23, but because they want to use it on 18th century German, or on Arabic blogs, or the next exciting thing. Which is why, once you have one point of reference firmly down, you want to get to another one to see if your assumptions still hold.
So, yes, it’s perfectly possible to do “Cargo cult” style NLP, which is why standardized evaluations and people actually replicating other’s experiments are both important. And I picked established tasks here because earlier mistakes are more visible and well-understood, not because I couldn’t come up with more egregious examples from new and exciting tasks.
Von: Noah A Smith Gesendet: Mittwoch, 9. April 2014 03:59 An: Kevin B. Cohen Cc: corpora
What are the "unknown ways" that one NLP researcher's conditions might differ from another NLP researcher's? If you're empirically measuring runtime, you might have a point. But if you're using a standardized dataset and automatic evaluation, it seems reasonable to report others' results for comparison. Since NLP is much more about methodology than scientific hypothesis testing, it's not clear what the "experimental control" should be. Is it really better to run your own implementation of the competing method? (Some reviewers would likely complain that you might not have replicated the method properly!) What about running the other researcher's code yourself? I don't think that's fundamentally different from reporting others' results, unless you don't trust what they report. Must I reannotate a Penn Treebank-style corpus every time I want to build a new parser?
-- Noah Smith Associate Professor School of Computer Science Carnegie Mellon University
On Tue, Apr 8, 2014 at 6:57 PM, Kevin B. Cohen <kevin.cohen at gmail.com> wrote:
I was recently reading the Wikipedia page on "cargo cult science," a concept attributed to no lesser a light than Richard Feynman. I found this on the page:
"An example of cargo cult science is an experiment that uses another researcher's results in lieu of an experimental control. Since the other researcher's conditions might differ from those of the present experiment in unknown ways, differences in the outcome might have no relation to the independent variable under consideration. Other examples, given by Feynman, are from educational research, psychology (particularly parapsychology), and physics. He also mentions other kinds of dishonesty, for example, falsely promoting one's research to secure funding."
If we all had a dime for every NLP paper we've read that used "another researcher's results in lieu of an experimental control," we wouldn't have to work for a living.
What do you think? Are we all cargo cultists in this respect?
Kevin Bretonnel Cohen, PhD Biomedical Text Mining Group Lead, Computational Bioscience Program, U. Colorado School of Medicine 303-916-2417 http://compbio.ucdenver.edu/Hunter_lab/Cohen
_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7936 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140409/23e42e37/attachment.txt>