[Corpora-List] A question about corpus linguistics with relevance to natural language processing: how to sample a population of papers about NLP?

Jin-Dong Kim jdkim at dbcls.rois.ac.jp
Thu Apr 13 10:41:20 CEST 2017

Hi Kevin,

I would opt for the options (1) and (2), at the same time.

I think Min-Yen Kan's suggestion to use the already compiled corpus of ACL anthology papers will make your work a lot simpler.

At the same time, as you suggested, stratified sampling may yield lots of interesting results, I feel.

For example, - ACL (as a study corpus) vs. ACL anthology (as a reference corpus), - LREC (as a study corpus) vs. ACL anthology (as a reference corpus)

may yields an interesting comparison.

Once, I was chairing "Language Resources" section of ACL, and many reviewers left comments like "LREC would be a better venue for this manuscript".

The above comparison may be able to explain the difference of the conferences in a descriptive way?

I also can imagine other interesting comparisons like, - Biomedical NLP (as a study corpus) vs ACL/NAACL/EACL (as a reference corpus), or even - ACL anthology (as a study corpus) vs Google Scholar (as a reference corpus).



On Wed, Apr 12, 2017 at 8:56 PM, #KOKIL JAIDKA# <KOKI0001 at e.ntu.edu.sg> wrote:
> Hi Kevin
> With reference to ACL, it is possible to programmatically download a list of
> papers using a few lines of Python code and there is no need for an API. One
> would expect you'd have a list of the paper identifiers you finally want. We
> followed a set of steps quite similar to the ones you've outlined to create
> our CL-SciSumm corpus. I'd be happy to link you to our resources if needed.
> Regards
> Kokil Jaidka
> Postdoctoral researcher
> University of Pennsylvania
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list