I would opt for the options (1) and (2), at the same time.
I think Min-Yen Kan's suggestion to use the already compiled corpus of ACL anthology papers will make your work a lot simpler.
At the same time, as you suggested, stratified sampling may yield lots of interesting results, I feel.
For example, - ACL (as a study corpus) vs. ACL anthology (as a reference corpus), - LREC (as a study corpus) vs. ACL anthology (as a reference corpus)
may yields an interesting comparison.
Once, I was chairing "Language Resources" section of ACL, and many reviewers left comments like "LREC would be a better venue for this manuscript".
The above comparison may be able to explain the difference of the conferences in a descriptive way?
I also can imagine other interesting comparisons like, - Biomedical NLP (as a study corpus) vs ACL/NAACL/EACL (as a reference corpus), or even - ACL anthology (as a study corpus) vs Google Scholar (as a reference corpus).
On Wed, Apr 12, 2017 at 8:56 PM, #KOKIL JAIDKA# <KOKI0001 at e.ntu.edu.sg> wrote:
> Hi Kevin
> With reference to ACL, it is possible to programmatically download a list of
> papers using a few lines of Python code and there is no need for an API. One
> would expect you'd have a list of the paper identifiers you finally want. We
> followed a set of steps quite similar to the ones you've outlined to create
> our CL-SciSumm corpus. I'd be happy to link you to our resources if needed.
> Kokil Jaidka
> Postdoctoral researcher
> University of Pennsylvania
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no