Workshop: Corpus Analysis with Noise in the Signal (CANS 2013) at Corpus Linguistics 2013 conference (CL2013), Lancaster University, UK. 22nd July 2013
Extended Submission deadline: 1st March 2013
Whilst many widely-used corpora include mainly standard written text on which a range of automatic corpus analysis and Natural Language Processing (NLP) techniques can be accurately performed, an increasing number of corpora contain substantial amounts of noisy textual data and irregular language. Such corpora range from relatively small specialised historical corpora (e.g. Early Modern English Medical Texts (EMEMT)) and second language learner corpora (e.g. French Learner Language Oral Corpora (FLLOC)) to very large datasets such as the transcribed Early English Books Online collection (EEBO-TCP), large collections of OCRed books (e.g. from Google Books) and the very large corpora being crawled from the web (e.g. from Twitter, and Web as Corpus). These non-standard language varieties can cause significant issues for corpus analysis tools, which in the majority of cases are set up and trained to deal with clean standard texts.
Our response to some of these issues has been the development of a Variant Detector tool (VARD2<http://ucrel.lancs.ac.uk/vard>). Originally developed to normalize spelling variants within historical English datasets, VARD2 has since been adapted for use with SMS, Twitter, child language, learner corpora, other languages, etc. The purpose of this workshop is to provide a format in which we can discuss - and compare - our approach with other researchers' approaches to noise. This may include work where researchers have used and adapted VARD2, or utilise new tools and methods.
We invite submissions to present research highlighting the impact of noisy textual data on corpus-based research and/or providing methods to negate the effect of such noise. We are interested in research concerning any corpora with substantial textual noise and are particularly keen to have a range of languages and noise sources represented at the workshop.
Noise sources may include but are not limited to: * Historical spelling variation * Computer-mediated language varieties (e.g. chatroom, SMS, social networks, blogs, Twitter, etc.) * First and second language learner corpora * Inaccurately digitised texts, e.g. badly OCRed or badly transcribed corpora * Idiosyncratic language usage/idiolect features
Topics of interest include but are not limited to: * Evaluations of established corpus analysis methodology when processing noisy corpora. * Methods for pre-processing noise in corpora, such as spelling normalisaton and error correction. * Development of noise-aware corpus analysis methods which are robust enough to deal with noisy corpora and process them with accuracy, e.g. new automatic part-of-speech taggers. * Analyses of the characteristics and trends of spelling variation and language irregularities. * Studies which highlight the importance of maintaining original spellings and language irregularities and how these can assist in some aspects of corpus analysis.
Two types of submissions are sought, full paper presentations and shorter work-in-progress reports. For full papers we require an extended abstract of 1,000-2,000 words. For work-in-progress reports we require shorter abstracts of 500-1,000 words. The deadline for submitting abstracts is 1st March 2013, they will then be reviewed by the organising committee and you will receive a response by 11th March 2013. The organising committee consists of: * Alistair Baron (Lancaster University) * Paul Rayson (Lancaster University) * Dawn Archer (University of Central Lancashire)
Papers should be submitted to cans2013 at comp.lancs.ac.uk<mailto:cans2013 at comp.lancs.ac.uk>, and should use the same guidelines and template as those for the main Corpus Linguistics 2013 conference, with the exception of text length restrictions. Further instructions for submission are provided on the workshop website<http://ucrel.lancs.ac.uk/cans2013/#submission>.
Accepted full papers will be allocated 20 minutes + 5 minutes for questions, accepted work-in progress reports will be allocated 10 minutes + 5 minutes for questions. The remaining time will include an open discussion of the papers presented and general topics such as: * What are the key challenges of dealing with noisy textual data going forward? * When should we leave "noise" where it is? And for what reason(s)? * What are the dangers of ignoring the noise?
We expect to select papers from the workshop for a peer-reviewed journal special issue.
In line with the policy of the conference organisers, you are welcome to submit abstracts both for this workshop and for the main Corpus Linguistics 2013 conference. However, if you give two papers they should be different, without substantial overlap.
Dr. Paul Rayson Director of UCREL and Senior Lecturer in Computer Science Faculty of Science and Technology Director of International Teaching Partnerships School of Computing and Communications, Infolab21, Lancaster University, Lancaster, LA1 4WA, UK. Web: http://www.comp.lancs.ac.uk/~paul/ Tel: +44 1524 510357 Fax: +44 1524 510492
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 27725 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20130218/08b8ca06/attachment.txt>