For those who are interested, this is a summary of the replies to my request below - many thanks to the contributors.
Adam Kilgarriff pointed to the paper “Comparing Corpora <http://kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf> ” (International Journal of Corpus Linguistics 2001 6 (1): 1-37), to recent work in the "web as corpus" community also on more general matters of comparisons between language varieties (e.g. Ferraresi et al.: “Introducing and evaluating ukWaC, a very large web-derived corpus of English”), and to the Sketch Engine, “which supports ´keyword´analyses between a subcorpus and the rest of the corpus it is part of.”
Ana Frankenberg sent information on the COMPARA parallel corpus of Portuguese “with different varieties of Portuguese and English and with a complex search facility which allows to compare and contrast different varieties of these two languages.”
Helen Johnson sent this link to a paper for “ideas of things to look at in the comparison of varieties, which compares English written by non-native English speakers in academic text and catalogues the variations based on an L1 language group”: “The Way We Write” http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1319188 <http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1319188>
Michal Kren pointed to their recent study regarding diachronic corpora comparison on the lexical level presented at Euralex 2008 (Corpus as a Means for Study of Lexical Usage Changes by Michal Křen and Jaroslava Hlaváčová) and especially to their “lesson learned: the necessity to obtain the highest possible comparability of the base corpora, otherwise you can end up with lots of garbage, as corpus composition differences can prove more significant than differences in language one wants to study. Presumably this can be why there are probably no automated tools of this kind for higher levels of language description. You may also find interesting paper by Joerg Asmussen on very similar topic, it is quoted in the references section of our paper.”
I’m still grateful for more hints also on system-independent corpus comparison approaches or tools, if there are any.
Thanks and best regards, Stefanie
From: Anstein Stefanie Sent: Tuesday, 15 July, 2008 11:41 To: 'corpora at uib.no' Subject: comparison of language varieties
This is a general query about comparing language variety corpora
following Asim’s questions (see below).
I am looking for any automated corpus studies and tools
for comparing the varieties of a language,
in order to take them as a basis for further research
on the development of tools for the systematic and automated
comparison of linguistic varieties on the basis of text corpora.
Up to now I have contacted researchers of several variety corpus projects,
e.g. the ‘International Corpus of English’ ICE,
the ‘Trésor de la Langue Française informatisé’ TLFi, or
the ‘Proyecto para el Estudio Sociolingüístico del Español de España y América’ PRESEA.
I got pointed to semi-automatic studies on the lexical level,
e.g. at the Centro de Linguística da Universidade de Lisboa (CLUL).
As far as I can see now, there have not been any publications
on automated comparison tools for higher levels of linguistic description,
e.g. on collocations, syntactic differences or even on the textual level.
So I’d appreciate references to such studies, starting from the lexical level.
In addition, I’d be grateful about any other ideas on contrasting ‘similar’ corpora / data sets,
which might also come from quite different research fields.
I will post a summary with the replies I get.
Thank you for any kinds of hints,
Institute for Specialised Communication and Multilingualism
Viale Druso 1, I-39100 Bolzano
t +39 0471 055 135
f +39 0471 055 199
stefanie.anstein at eurac.edu
This transmission is intended only for the use of the addressee and may contain confidential or legally privileged information.
If you receive this transmission by error, please notify the author immediately by mail and delete all copies of this transmission and any attachments.
Any use or dissemination of this communication is strictly prohibited by the "Privacy-Code", D.Lgs. 196/2003 and may conduct to penal prosecution and liability for damages.
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Asim Sent: Tuesday, 27 May, 2008 19:41 To: corpora at uib.no Subject: [Corpora-List] request for parsing and making the data in a form tobe used by wordsmith
I am working on Pakistani English. I have compiled a 2.1 million word corpus of written Pakistani English. It is the first ever corpus of Pakistani English .
I want to study the features of Pakistani variety of English. Could any tell me how to locate them. Any suggestion would be welcome.
I have tagged it and now trying to analyse it using both top down and bottom up approaches.
I want to study the verb particles and for this I want to parse the data as I think it is the only possibility that I can get the confirmation that either it is a preposition or particle. If there is any other way except manual study just tell me and I will be obliged.
Another issue is when I use some online available demo parsers like LFG how to store the results to be used with wordsmith 4 and use them to locate all the particles from my data .
Is there any solution.
Wish to hear from you soon.
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 19704 bytes Desc: not available Url : http://www.uib.no/mailman/public/corpora/attachments/20080730/712d0951/attachment.txt