Workshop on *'**(Semi-)automatic retrieval of data from historical corpora: chances and challenges'* at the 52th Annual Meeting of the Societas Linguistica Europaea, 21st – 24th August 2019 (Leipzig University, Germany)
Convenors: Marianne Hundt, Melanie Röthlisberger, Gerold Schneider, Eva Zehentner
Developments in historical corpus linguistics have taken a similar route as in corpus-based research on present-day languages: from the creation of small reference corpora to increasingly larger databases and from text-only to richly annotated resources. However, historical data have always posed particular challenges for the development of corpus resources, their annotation, and their analysis. Corpus representativeness and balancedness, for instance, has been impaired by the limited availability of texts, particularly for the very early stages of written attestation. Additionally, the highly variable orthography typical of earlier texts has meant that the tools developed for more uniform data cannot be applied in a straightforward manner to historical corpora. In the case of smaller corpora, this has resulted in grammatical annotation through manual annotation or post-editing For the increasingly larger resources, however, manual annotation is tedious, and researchers have developed tools for pre-processing like spelling normalisation (Baron and Rayson 2008) and lemmatisation (Burns 2013) to enable automatic tagging and parsing. Matters are complicated further by the fact that a range of different annotated resources exist (*Penn Treebank, Penn Parsed Corpora, Universal Dependency Treebanks*) and different parsing tools (e.g. Schneider 2012) have been applied to historical corpora, which are likely to require different retrieval strategies, which in turn make comparisons across corpora difficult. While the list of syntactic parsers is large (e.g. Schneider (2008) for English, Sennrich et al. (2009) for German, van Noord (2006) for Dutch, Alberti et al. 2017 for *Universal Dependency parsing*), few have been used on, or adapted to historical texts.
The aim of this workshop is to focus on the challenges that (semi-)automatic retrieval of data from historical corpora pose for the study of grammatical change, specifically in English, German, and Dutch. In particular, we invite contributions addressing related (but not limited) to the following:
- mapping of different annotation schemes
- evaluation of bottom-up approaches to data retrieval for language
- issues of precision and recall in historical corpora
Ultimately, this workshop seeks to provide a platform for researchers working within these subject areas to exchange ideas and to jointly address the challenges (and chances) we are faced with.
We invite researchers to submit an anonymised abstract of 300 words (excluding
references) to *retrievalSLE2019 at gmail.com <retrievalSLE2019 at gmail.com>* by *November 12, 2018*. Talks will
be 20 minutes each, with 5 minutes for discussion and 5 minutes for speaker change. The workshop will start with an introduction by the organisers, who will summarise previous research, the research questions addressed in the workshop and the scope of the papers to be presented. The workshop will be concluded with a final discussion.
The workshop proposal to be submitted to the SLE organisers will include all participants’ abstracts. Notification of acceptance/rejection of the workshop proposal by the SLE will be given by December 15, 2018. If our workshop proposal is accepted, we will invite all preliminary workshop participants to submit their full abstracts by January 15, 2019 to the general call for papers for review.
Eva Zehentner Lecturer in English Language and Linguistics Department of Language and Linguistic Science University of York Heslington, York, YO10 5DD -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7218 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181031/e9f6deb7/attachment.txt>