[Corpora-List] corpora of grammatical errors

Anabela Barreiro barreiro_anabela at hotmail.com
Mon Apr 16 13:43:01 CEST 2012


This is funny and for fun... :) I (unshamefully) do not properly restrain in self assertion of being a good proficient second language writter (even near-native writter, in my most egocentric moments ;) ) This is an example of on how my previous e-mail to the list would be corrected/improved by a linguistically-sophisticated grammar checker (smarter than I was when I wrote my message): ----------------- Dear Corpora-List Members, I would like to thank all who have sent me personal e-mails with suggestions, including indication on where to find corpora for languages other than English and the Romance languages.

In reply to Ramesh,

I would say that they all contain sentences with grammatical errors. I am interested in corpora with sentences that have errors demonstrating particular aspects of the grammar (prepositions, verb tenses, negation, coordination, etc., etc., etc.) with some pre-selection and pre-categorization of the ungrammaticality of the sentences. In the past, system developers used what were called "test suites", mostly fabricated by linguists for the specific purpose of testing a particular system, which included files with ungrammatical sentences. I am interested in sentences that come from "real" usage of language by non-native speakers, and from native speakers with writing difficulties or writing texts where language and style is not optimized and needs to be improved. When supporting editing of a text, existing grammar checkers are not sophisticated enough to identify all the grammar problems and often identify as a problem perfectly correct sentences (false positives and false negatives). In addition to correction, there is also the potential for providing better solutions for writing (including more categories to the typology)... For example, I can fix support verb constructions with "weak" verbs into semantically "strong" verbs, which gives the text a more professional style, eliminates words that are unnecessary, helps texts being translated more efficiently by humans and machines, etc.


>From my request on this list, I found out that there is an ongoing shared task concerned with the automated correction of errors in text by Robert Dale and Adam Kilgarriff :
http://clt.mq.edu.au/research/projects/hoo/

This is an especially interesting task because it groups errors into linguistic categories. Hoo already includes preposition and determiner errors in exam scripts authored by learners of English as a Second Language, but their goal is to enlarge the typology of linguistic errors. That's all I wished for :)

----------------- Have a good day! Anabela.From: barreiro_anabela at hotmail.com To: r.krishnamurthy at aston.ac.uk CC: corpora at uib.no Subject: RE: corpora of grammatical errors Date: Mon, 16 Apr 2012 10:33:42 +0000

Dear Corpora-List Members, I would like to thank all who have sent me individual e-mails with suggestions, including indication on where to find corpora for languages other than English and the Romance languages.

In reply to Ramesh,

I would say that they all contain sentences with grammatical errors. I am interested in corpora where all sentences have errors on particular aspects of the grammar (prepositions, verb tenses, negation, coordination, etc., etc., etc.) with some pre-selection and pre-categorization of the ungrammaticality of the sentences. In the past, system developers used what was called "test suites", mostly fabricated by linguists for the specific purpose of testing a particular system. I am interested in sentences that come from "real" usage of language by non-native speakers, but also native speakers with writing difficulties or writing texts where language and style is not optimized or could be improved. When supporting editing of a text, existing grammar checkers are not sophisticated enough to identify all the grammar problems and often identify as a problem perfectly correct sentences (false positives and false negatives). In addition to correction, there is also the potential for providing better solutions for writing (including more categories to the typology)... For example, I can fix support verb constructions with "weak" verbs into semantically "strong" verbs, which gives the text a more professional style, eliminates words that are unecessary, helps texts being translated more efficiently by humans and machines, etc.


>From my request on this list, I found out that there is an ongoing shared task concerned with the automated correction of errors in text by Robert Dale and Adam Kilgarriff :
http://clt.mq.edu.au/research/projects/hoo/

This is a especially interesting task because it groups errors into linguistic categories. Hoo already includes preposition and determiner errors in exam scripts authored by learners of English as a Second Language, but their goal is to enlarge the typology of linguistic errors. That's all I wished for :)

Thank you all,

Anabela

-------------------------------------------------------------------------------------------------Think GREEN - Act GREEN!

Anabela M. Barreiro Personal webpage: https://www.l2f.inesc-id.pt/wiki/index.php/Anabela_BarreiroLinkedIn: http://www.linkedin.com/in/anabelabarreiro -------------------------------------------------------------------------------------------------From: r.krishnamurthy at aston.ac.uk To: barreiro_anabela at hotmail.com CC: corpora at uib.no Subject: corpora of grammatical errors Date: Sun, 15 Apr 2012 12:42:20 +0000

Hi Anabela

#1 Do ALL the currently available public corpora not ‘contain sentences with grammatical errors’? Very few (if any) texts will be 100% grammatically ‘correct’ (whichever model of grammar you use)? So BNC, COCA, etc should be OK for you? But the specific ‘errors’ your system identifies will of course depend on your choice of model.

#2 If you want a corpus with a high proportion of ‘errors’, would any available LANGUAGE LEARNER,

NON-NATIVE-SPEAKER, NON-STANDARD, or VARIETAL corpus be sufficient for your purposes? These corpora should be easy to find via Google, by specifying one of those attributes?

Hope this helps Ramesh

Ramesh Krishnamurthy Visiting Academic Fellow, School of Languages and Social Sciences, Aston University, Birmingham B4 7ET

Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/ Corpus Analyst: (a) GeWiss (Volkswagen Foundation) project: http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/ (b) Discourse of Climate Change: http://www1.aston.ac.uk/lss/research/research-projects/discourse-of-climate-change-project/ (c) Feminism: http://acorn.aston.ac.uk/projects.html (d) COMENEGO (Corpus Multilingüe de Economía y Negocios) - Multilingual Corpus of Business and Economics: http://dti.ua.es/comenego (e) European Phraseology Project: http://labidiomas3.ua.es/phraseology/login/login.php -------------------------------------------------------------------------------------------------------------------------

Date: Sat, 14 Apr 2012 10:24:50 +0000 From: Anabela Barreiro <barreiro_anabela at hotmail.com> Subject: [Corpora-List] corpora of grammatical errors To: "corpora at uib.no" <corpora at uib.no>

Dear Corpora List Members,

I am looking for public corpora containing sentences with grammatical errors.

I plan to use the corpora as input to grammar checking and correction routines.

The corpora can be in English or romance languages. I appreciate any indication of where I can find those corpora. Thank you!

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 13437 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120416/ac4bf44d/attachment.txt>



More information about the Corpora mailing list