[Corpora-List] A question about corpus linguistics with relevance to natural language processing: how to sample a population of papers about NLP?

Min-Yen Kan knmnyn at gmail.com
Sun Apr 9 05:54:07 CEST 2017

Hi Kevin, all:

WRT your option #1, please let me know if you need help with any backend processing with the ACL Anthology. We do try to support projects that use the Anthology as a corpus, because, well, that is part of the mission of the Anthology. E.g., need bulk downloads of a certain section of the Anthology. Feel free to contact me directly outside of corpora-list, just replying publicly here since it may be of help to others too.

The ACL Anthology rather loosely defines what is CL/NLP. We basically ingest whatever metadata that a related conference/workshop/event gives us, and can indemnify the ACL from redistributing its publications. It has to be deemed related by the contributor as well as the editor (I am rather lax on this, as I think the readership can weed out irrelevant work), to CL/NLP.

You might also find the ACL Anthology Reference Corpus (ACL ARC v2; to hopefully be published with the LDC sometime later this year) useful. http://acl-arc.comp.nus.edu.sg/

@all: this extends to any and all other uses of the Anthology as a corpus.


Min ACL Anthology Editor

On Thu, Apr 6, 2017 at 2:14 PM, Kevin B. Cohen <kevin.cohen at gmail.com> wrote:
> Hello, all,
> I have a question somewhere in the intersection between corpus linguistics
> and natural language processing. I'd like to do a descriptive study of
> natural language processing papers, but I have no idea how to draw a
> representative sample of them.
> In case it's helpful to know why I want to draw a representative sample of
> natural language processing papers: I'm interested in what kind of evidence
> our literature provides for a number of characteristics of research in our
> community, ranging from the stances that our community takes on theoretical
> questions to the extent to which our work reflects the characteristics of
> reproducible research.
> The first question that I have is how to define the population that one
> would want to sample. I can think of at least three possibilities, each
> with some strengths and weaknesses.
> 1) The population of publications that are indexed in the ACL Anthology.
> This would have the advantages of being a clearly definable population that
> is presumably a consensus definition of what counts as NLP, and a population
> that contains journals as well as conference papers. The disadvantages of
> searching only the ACL Anthology would be that the sample would contain only
> English-language publications; that I frequently have trouble following
> links on the site, particularly to the bibliographic information, which
> becomes a problem when you want to list and retrieve the documents; and that
> it would reduce the chances of getting a result that generalizes to other
> populations, such as the large number of papers on biomedical natural
> language processing that show up in PubMed/MEDLINE, but not in the ACL
> Anthology. (Presumably the same issue occurs with other topics, e.g.
> digital humanities, which increasingly uses word embeddings.)
> 2) A stratified sample of conferences, such as ACL, EACL, and NAACL. This
> would have the advantage of letting me compare the numbers that I get to
> something else, without which it's difficult to interpret any number (e.g.
> if I sampled three conferences, I could do three chi square tests with a
> multiple testing correction). The disadvantage would be that as you get
> increasingly specific about which conferences you’re looking at, it becomes
> increasingly more difficult to understand what population you’re actually
> sampling (e.g. I have a clear picture of what kind of paper lands in the
> ACL Anthology, but what, if anything, makes something an EACL paper, other
> than reflecting the fact that the meeting took places in Europe?), as well
> as to untangle the potential real effects from random vagaries of the
> individual conferences.
> 3) Google Scholar, which has the advantage that it presumably has broader
> coverage than the ACL Anthology, and the disadvantage that it may be
> difficult to convince reviewers that it’s a reasonable population (versus
> the ACL Anthology).
> Having picked a source to sample, there remains the question of how to
> define the population within the source. Do you sample the entire
> collection, or do you further define the population in terms of specific
> task types—for example, information extraction, named entity recognition,
> machine translation? The advantage of the former is that it’s more
> representative, while the advantage of the latter is that it keeps real
> sources of variability in the things that I’m interested in from being
> swamped/confounded by the relative prevalence of different task types in the
> source.
> Having made a choice of source and a definition of the population(s), I
> still have the problem of how to draw a representative sample, and here I
> have very few ideas. Is it as simple as downloading a list of document
> identifiers and drawing a random sample from it? That actually seems
> reasonable, given that by this point, whatever decisions one made regarding
> the questions above should have yielded a reasonable source for sampling.
> But, then there are some really practical questions that have to be dealt
> with in order to get that list of document identifiers. In particular: of
> the three choices that I identified above (ACL Anthology, PubMed/MEDLINE,
> and Google Scholar), only PubMed/MEDLINE offers the ability to save the
> results of a search, and that’s presumably the least representative of the
> broader NLP community of any of the potential sources that I’ve listed. (As
> far as I know, there’s no API for programmatic search of Google Scholar, and
> there’s not likely to be one for the ACL anthology any time soon, either.)
> Any insights would be appreciated, and I would be happy to summarize them
> for the list.
> Thank you,
> Kevin Cohen
> --
> Kevin Bretonnel Cohen, PhD
> Director, Biomedical Text Mining Group
> Computational Bioscience Program, U. Colorado School of Medicine
> Chair in Natural Language Processing for the Biomedical Domain
> Université Paris-Saclay, LIMSI-CNRS
> 303-916-2417
> http://compbio.ucdenver.edu/Hunter_lab/Cohen
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list