[Corpora-List] Learner corpora build & query tool?

Jarmo Jantunen Jarmo.Jantunen at oulu.fi
Wed Feb 25 08:25:05 CET 2009

Dear Simon,

Perhaps you would like to have a look at the ICLFI Corpus - the International Corpus of Learner Finnish that is collected at the University of Oulu in Finland. That corpus also includes a subcorpus of learner Finnish produced by Chinese language learners. The size of the total corpus is approximately 320 000 tokens at the moment.

The description of the ICLFI Corpus can be found at the web page http://www.oulu.fi/hutk/sutvi/oppijankieli/ICLFI_Corpus.html

Best wishes,

Jarmo Harri Jantunen

Adjunct professor, senior lecturer

Finnish as a Second and Foreign Language Faculty of Humanities P.O. Box 1000 FI-90014 University of Oulu Finland Tel. +358 8 553 3478 http://www.oulu.fi/hutk/sutvi/henkilokunta/jjantunen.html http://www.oulu.fi/hutk/sutvi/oppijankieli


Lähettäjä: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] Puolesta simon smith Lähetetty: 24. helmikuuta 2009 7:20 Vastaanottaja: CORPORA at uib.no Aihe: [Corpora-List] Learner corpora build & query tool?

I've been looking over the resources recommended to Mieke van der Velden on the list with considerable interest.

Here at NCCU in Taiwan, we have 8 language departments -- English, French, German, Korean, Japanese, Spanish, Arabic, Turkish -- and we plan to build a learner corpus for each. Although this sounds like an ambitious scheme, it has support and funding from the central university administration.

The people studying these languages, here in Taiwan, are native speakers of Chinese. I'm aware of Chinese speaker learner corpora of some of the languages: English obviously, Spanish and Japanese (and German planned) at National Chengkung University. But I'm interested to know if any of our planned corpora will be firsts. It seems pretty unlikely that there exists a Chinese speaker LC of Turkish, for example. So if you are reading this, and you know of an existing Chinese speaker LC of one of our languages, perhaps you could let me know.

It's a longish-term project, and we're not too clear at the moment what sort of interlanguage annotation or correction we'll be doing. Right now, the important thing is to start collecting data. We could probably create our own interface to do this, but I wonder if there is a (free or shareware) product out there that we could use for LC building.

It would need to be pretty straightforward to use, because the language teachers collaborating will have no experience of corpora or corpus linguistics. Some of them will, indeed, have very little computer experience at all.

Ideally, we would collect the data (as homework assignments) directly from students. I'm wondering about the possibility of using Moodle for this, either the Database or Wiki modules ( there is a Corpus module but it's not supported any more). The students would input their data, and everyone would be able to see it. In the Wiki, we could allow teachers to edit it, and a record of changes would be kept.

But I'm not how easy it would be to do annotation of a "corpus" in that format, or really analyse it in a conventional way. There would be no obvious way of generating a concordance, for example.

I really like the idea of a shared resource which can be built, updated, consulted and used by learners, all via the same interface.

Any thoughts anyone?


Simon Smith, PhD

Assistant Professor Foreign Language Center National Chengchi University

office: Research Building 416 phone: (0)2 2939 3091 x 88015 fax +44 (0)871 243 1512

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9305 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090225/5bb61ea4/attachment.txt

More information about the Corpora mailing list