[Corpora-List] 1st CfP: Corpus Analysis with Noise in the Signal (CANS 2013) workshop

Alistair Baron a.baron at comp.lancs.ac.uk
Mon Dec 24 12:54:49 CET 2012

*Call for Papers* * * *Workshop: Corpus Analysis with Noise in the Signal (CANS 2013)* * * *at Corpus Linguistics 2013 conference (CL2013), Lancaster University, UK.* * * *22nd July 2013* * * *http://ucrel.lancs.ac.uk/cans2013/ * * * *Submission deadline: 22nd February 2013* * * Whilst many widely-used corpora include mainly standard written text on which a range of automatic corpus analysis and Natural Language Processing (NLP) techniques can be accurately performed, an increasing number of corpora contain substantial amounts of noisy textual data and irregular language. Such corpora range from relatively small specialised historical corpora (e.g. Early Modern English Medical Texts (EMEMT)) and second language learner corpora (e.g. French Learner Language Oral Corpora (FLLOC)) to very large datasets such as the transcribed Early English Books Online collection (EEBO-TCP), large collections of OCRed books (e.g. from Google Books) and the very large corpora being crawled from the web (e.g. from Twitter, and Web as Corpus). These non-standard language varieties can cause significant issues for corpus analysis tools, which in the majority of cases are set up and trained to deal with clean standard texts.

Our response to some of these issues has been the development of a Variant Detector tool (VARD2 <http://ucrel.lancs.ac.uk/vard>). Originally developed to normalize spelling variants within historical English datasets, VARD2 has since been adapted for use with SMS, Twitter, child language, learner corpora, other languages, etc. The purpose of this workshop is to provide a format in which we can discuss - and compare - our approach with other researchers' approaches to noise. This may include work where researchers have used and adapted VARD2, or utilise new tools and methods.

We invite submissions to present research highlighting the impact of noisy textual data on corpus-based research and/or providing methods to negate the effect of such noise. We are interested in research concerning any corpora with substantial textual noise and are particularly keen to have a range of languages and noise sources represented at the workshop.

Noise sources may include but are not limited to:

- Historical spelling variation

- Computer-mediated language varieties (e.g. chatroom, SMS, social

networks, blogs, Twitter, etc.)

- First and second language learner corpora

- Inaccurately digitised texts, e.g. badly OCRed or badly transcribed


- Idiosyncratic language usage/idiolect features

Topics of interest include but are not limited to:

- Evaluations of established corpus analysis methodology when processing

noisy corpora.

- Methods for pre-processing noise in corpora, such as spelling

normalisaton and error correction.

- Development of noise-aware corpus analysis methods which are robust

enough to deal with noisy corpora and process them with accuracy, e.g. new

automatic part-of-speech taggers.

- Analyses of the characteristics and trends of spelling variation and

language irregularities.

- Studies which highlight the importance of maintaining original

spellings and language irregularities and how these can assist in some

aspects of corpus analysis.

Two types of submissions are sought, full paper presentations and shorter work-in-progress reports. For full papers we require an extended abstract of 1,000-2,000 words. For work-in-progress reports we require shorter abstracts of 500-1,000 words. The deadline for submitting abstracts is *22nd February 2013*, they will then be reviewed by the organising committee and you will receive a response by 11th March 2013. The organising committee consists of:

- Alistair Baron (Lancaster University)

- Paul Rayson (Lancaster University)

- Dawn Archer (University of Central Lancashire)

Papers should be submitted to cans2013 at comp.lancs.ac.uk, and should use the same guidelines and template as those for the main Corpus Linguistics 2013 conference, with the exception of text length restrictions. Further instructions for submission are provided on the workshop website<http://ucrel.lancs.ac.uk/cans2013/#submission> .

Accepted full papers will be allocated 20 minutes + 5 minutes for questions, accepted work-in progress reports will be allocated 10 minutes + 5 minutes for questions. The remaining time will include an open discussion of the papers presented and general topics such as:

- What are the key challenges of dealing with noisy textual data going


- When should we leave "noise" where it is? And for what reason(s)?

- What are the dangers of ignoring the noise?

We expect to select papers from the workshop for a peer-reviewed journal special issue.

*In line with the policy of the conference organisers, you are welcome to submit abstracts both for this workshop and for the main Corpus Linguistics 2013 conference. However, if you give two papers they should be different, without substantial overlap.*

-- Dr. Alistair Baron Faculty Research Fellow Security Lancaster School of Computing and Communications Infolab21 Lancaster University Lancaster LA1 4WA UK

O: B61 Infolab21 T: +44 (0)1524 510519 (temp.) E: a.baron at lancs.ac.uk <a.baron at comp.lancs.ac.uk> W: http://www.comp.lancs.ac.uk/~barona -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6457 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121224/984a6e7c/attachment.txt>

More information about the Corpora mailing list