Hi All,

a very interesting discussion. A few additional notes, largely from a UK perspective:

1.) The Linguistics Society of America has passed a resolution urging that research outputs such as corpora should be viewed as scholarly outputs in their own right: https://www.linguisticsociety.org/resource/resolution-recognizing-scholarly-merit-language-documentation 2.) In the UK, our national research assessment exercise allows for corpora to be submitted to it. Any corpora submitted to it are assessed on a par with other research outputs, e.g. journal articles, books etc. 3.) The UK research councils certainly view corpora as research outputs and require that those produced with their support are duly catalogued and reported. 4.) In terms of peer review, when seeking support for corpus construction from funders the plans/need for a corpus are assessed through peer review, i.e. you make a grant application and that gets assessed. Likewise at the end of a grant the corpus itself may be subject to review, though end of grant reviews fluctuate over time in the UK between being more and less formal. 5.) As Mark notes, it is often the case that people write an account of a corpus and the decisions made in building it - that output (e.g. a journal article) is, of course, peer reviewed. 6.) I think that archives such as ELRA and LDC do check the corpora that they distribute, though those checks tend to be formal rather than conceptual in my experience, e.g. if the corpus uses XML, they check that the XML parses.

Of course, many corpora are also, in effect, given a post publication peer review, i.e. people use, edit and critique that data.

On a separate, though related, note the issue with promotions committees that Mark notes also applies, at times, to researchers who produce software packages. I suspect the dynamic there is very similar to that discussed already, with certain disciplines being more open to recognising software as a research output than others. Points 3-5 above (at least) apply to producers of software as much as they do to producers of corpora.

I am very much enjoying this conversation. I work primarily between the field of Science of Science and Semantic Web. Mostly, bringing semantic web technologies to the advantage of Science of Science advances. With regards to this conversation, at the International Semantic Web Conference, which is the premiere conference of the field, we have the "Resource track" where we can submit resources – such as datasets, ontologies, vocabularies, software and others. And the process is identical to any other track: it goes through peer review. Indeed, one of my most cited papers is the "The computer science ontology: a large-scale taxonomy of research areas" which is the largest ontology of research topics in the field of Computer Science and is published through the track I mentioned above.

"In accordance with Open Science principles, research papers may also be in the form of data papers and software papers (short or long papers). The former present the motivation and methodology behind the creation of data sets that are of value to the community; e.g., annotated corpora, benchmark collections, training sets. The latter present software functionality, its value for the community, and its application to a non-specialist reader. To enable reproducibility and peer-review, authors will be requested to share the DOIs of the data sets and the software products described in the articles and thoroughly describe their construction and reuse."
"In accordance with Open Science principles, research papers may also be in the form of data papers and software papers (short or long papers). The former present the motivation and methodology behind the creation of data sets that are of value to the community; e.g., annotated corpora, benchmark collections, training sets. The latter present software functionality, its value for the community, and its application to a non-specialist reader. To enable reproducibility and peer-review, authors will be requested to share the DOIs of the data sets and the software products described in the articles and thoroughly describe their construction and reuse."

On Tue, 29 Dec 2020 at 03:45, Khurshid Ahmad <kahmad at scss.tcd.ie<mailto:kahmad at scss.tcd.ie>> wrote: Dear Hugh The 'peer review' is very important: one measure of the impact of your scholarship in this digital era is the number of downloads your corpus/corpora has/have. As Mark has rightly suggested, computer science folks are more receptive to this idea. You might enter the downloads as a measure of esteem your colleagues have. The number of hits is a key measure of ranking employed by search engine, and in Google ranking the 'fancy hits'- the number of people looking upto your website is critical for higher ranking. In another domain, mass communications, the downloads may indicate your reputation.

I am not much in favour of the so-called are journal publications, in some branches of engineering and physics, the publication of your research in a 'letter' or rapid communications journals is regarded more highly, and in yet other disciplines a monograph is essential.

Whatever happens please keep up the good work and promote data driven research.

>>> Are their alternative models (and or vocabulary) being used for
> discussing how the compilation of a corpus is part of one's scientific
> output?
> I've created a number of corpora [1] that have been widely used by
> researchers. But I've worked in a College of Humanities where rank and
> status committees are typically dominated by people in literary and
> cultural studies, where the only thing they really understand is the
> all-important journal article. (Even peer-reviewed conference papers
> are usually suspect in their eyes.) And they would never understand,
> for example, data from something like Google Analytics, which provides
> concrete data on the number of people actually using the corpora [2],
> or the number of citations in Google Scholar [3].
> So for each of the corpora that I've created, I've tried to make sure
> that I do have journal articles [4] that explain the creation and use
> of the corpora. Of course if you're in a college that includes
> computer science, for example, they will probably be more open-minded
> to the intrinsic value of creating corpora / large datasets that are
> widely used by other researchers.
> Greetings,
> Peer-reviewed publication is an important part of academic advancement
> in many job situations. I am not seeing any discussion in the
> literature on how corpora are being "peer-reviewed" (I'm using google
> scholar). Are their alternative models (and or vocabulary) being used
> for discussing how the compilation of a corpus is part of one's
> scientific output? any recent papers on this issue? I see some recent
> literature discussion on data citation, and software citation, but
> these don't address the peer-review aspect, and don't specifically
> address corpora.
> [1] https://www.english-corpora.org/<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.english-corpora.org%2F&data=04%7C01%7Ca.mcenery%40lancaster.ac.uk%7C80121a3208564ffc2d8408d8abdddf9e%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637448316717922430%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SnmJ6MrOTO4cFvbjSXwisKn7L1TxrYWxXaBV1uyxOC4%3D&reserved=0>
> [2] https://www.english-corpora.org/users.asp<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.english-corpora.org%2Fusers.asp&data=04%7C01%7Ca.mcenery%40lancaster.ac.uk%7C80121a3208564ffc2d8408d8abdddf9e%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637448316717932423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EAC%2FNtG%2F7mSDgG6HV2AW9H9U1mRcTr3VgVgj1xFnOCQ%3D&reserved=0>
> [3]
> https://scholar.google.com/citations?user=8-LRgUIAAAAJ&amp;hl=en&amp;oi=ao<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fscholar.google.com%2Fcitations%3Fuser%3D8-LRgUIAAAAJ%26hl%3Den%26oi%3Dao&data=04%7C01%7Ca.mcenery%40lancaster.ac.uk%7C80121a3208564ffc2d8408d8abdddf9e%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637448316717932423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8k19Ah0iTD3w5RR9y7GYgC6sjtFNs4XZReuatsSAljQ%3D&reserved=0>
> [4] https://www.mark-davies.info/vita.pdf<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mark-davies.info%2Fvita.pdf&data=04%7C01%7Ca.mcenery%40lancaster.ac.uk%7C80121a3208564ffc2d8408d8abdddf9e%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C637448316717942419%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=juJ4BbdKHPpei04T7P6N5A2DjBWbUL3qNDBi%2FvdrrWg%3D&reserved=0>
