[Corpora-List] A question about corpus linguistics with relevance to natural language processing: how to sample a population of papers about NLP?

Kevin B. Cohen kevin.cohen at gmail.com
Thu Apr 6 08:14:08 CEST 2017


Hello, all,

I have a question somewhere in the intersection between corpus linguistics and natural language processing. I'd like to do a descriptive study of natural language processing papers, but I have no idea how to draw a representative sample of them.

In case it's helpful to know why I want to draw a representative sample of natural language processing papers: I'm interested in what kind of evidence our literature provides for a number of characteristics of research in our community, ranging from the stances that our community takes on theoretical questions to the extent to which our work reflects the characteristics of reproducible research.

The first question that I have is how to define the population that one would want to sample. I can think of at least three possibilities, each with some strengths and weaknesses.

1) The population of publications that are indexed in the ACL Anthology. This would have the advantages of being a clearly definable population that is presumably a consensus definition of what counts as NLP, and a population that contains journals as well as conference papers. The disadvantages of searching only the ACL Anthology would be that the sample would contain only English-language publications; that I frequently have trouble following links on the site, particularly to the bibliographic information, which becomes a problem when you want to list and retrieve the documents; and that it would reduce the chances of getting a result that generalizes to other populations, such as the large number of papers on biomedical natural language processing that show up in PubMed/MEDLINE, but not in the ACL Anthology. (Presumably the same issue occurs with other topics, e.g. digital humanities, which increasingly uses word embeddings.)

2) A stratified sample of conferences, such as ACL, EACL, and NAACL. This would have the advantage of letting me compare the numbers that I get to something else, without which it's difficult to interpret any number (e.g. if I sampled three conferences, I could do three chi square tests with a multiple testing correction). The disadvantage would be that as you get increasingly specific about which conferences you’re looking at, it becomes increasingly more difficult to understand what population you’re actually sampling (e.g. I have a clear picture of what kind of paper lands in the ACL Anthology, but what, if anything, makes something an EACL paper, other than reflecting the fact that the meeting took places in Europe?), as well as to untangle the potential real effects from random vagaries of the individual conferences.

3) Google Scholar, which has the advantage that it presumably has broader coverage than the ACL Anthology, and the disadvantage that it may be difficult to convince reviewers that it’s a reasonable population (versus the ACL Anthology).

Having picked a source to sample, there remains the question of how to define the population within the source. Do you sample the entire collection, or do you further define the population in terms of specific task types—for example, information extraction, named entity recognition, machine translation? The advantage of the former is that it’s more representative, while the advantage of the latter is that it keeps real sources of variability in the things that I’m interested in from being swamped/confounded by the relative prevalence of different task types in the source.

Having made a choice of source and a definition of the population(s), I still have the problem of how to draw a representative sample, and here I have very few ideas. Is it as simple as downloading a list of document identifiers and drawing a random sample from it? That actually seems reasonable, given that by this point, whatever decisions one made regarding the questions above should have yielded a reasonable source for sampling. But, then there are some really practical questions that have to be dealt with in order to get that list of document identifiers. In particular: of the three choices that I identified above (ACL Anthology, PubMed/MEDLINE, and Google Scholar), only PubMed/MEDLINE offers the ability to save the results of a search, and that’s presumably the least representative of the broader NLP community of any of the potential sources that I’ve listed. (As far as I know, there’s no API for programmatic search of Google Scholar, and there’s not likely to be one for the ACL anthology any time soon, either.)

Any insights would be appreciated, and I would be happy to summarize them for the list.

Thank you,

Kevin Cohen -- Kevin Bretonnel Cohen, PhD Director, Biomedical Text Mining Group Computational Bioscience Program, U. Colorado School of Medicine Chair in Natural Language Processing for the Biomedical Domain Université Paris-Saclay, LIMSI-CNRS 303-916-2417 http://compbio.ucdenver.edu/Hunter_lab/Cohen -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7678 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170406/5b225715/attachment.txt>



More information about the Corpora mailing list