Web corpora: English French German Italian Japanese Spanish
Non-web corpora: Chinese English Slovene Portuguese
Web corpora are, for general-language research purposes, usually better than newspaper corpora -the usual alternative- as they cover a much wider range of text types (see various studies by Serge Sharoff).
For Italian we have a 2b web corpus prepared by Marco Baroni, see Baroni and Kilgarriff 2006 Large linguistically-processed Web corpora for multiple languages <http://kilgarriff.co.uk/Publications/2006-BaroniKilg-EACL-DeWAC.pdf> Proc. EACL. Trento, Italy.
Adam Kilgarriff
-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Emiliano Guevara Sent: 14 November 2007 23:08 To: CORPORA Subject: Re: [Corpora-List] General Italian wordlist
Dear Jane,
unfortunately, there are still neither freely available, nor freely manipulable "general" corpora in Italian comparable to the BNC (I suppose what you mean is a reference corpus, balanced according to genre, medium, large enough in size to be representative of the whole language, etc).
I guess the best you can get is either wordlists generated from web corpora or from large unbalanced corpora such as "La Repubblica corpus" (check http://dev.sslmit.unibo.it/corpora/corpus.php? path=&name=Repubblica).
The good news is: you can get all of this right at Bologna University!
I'll be happy to help you with any of these alternatives, and eventually also to find a better way to do the keyword search beyond what WSTools has to offer (when you start playing with several million words, WSTools just chokes...).
Cheers,
Emiliano
On 14 Nov 2007, at 16:12, jane..johnson@@libero..it wrote:
> Similar to the BNC_World.lst for use with the Keyword tool of the
> WordSmith suite, I am looking for a wordlist generated from a
> general corpus of contemporary Italian to create a Keyword list
> for a selection of Italian novels. Can anyone point me in the right
> direction? thanks
> Jane Johnson
> University of Bologna
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
**************************************** Emiliano R. Guevara Facoltà di Lingue e Lett. Straniere Dip. di Lingue e Lett. Straniere Università di Bologna Via Cartoleria 5 (40124) Bologna, Italia
Homepage: http://morbo.lingue.unibo.it/
E-mail: emiliano.guevara at unibo.it
emiguevara at gmail.com ****************************************
_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 8124 bytes Desc: not available Url : http://www.uib.no/mailman/public/corpora/attachments/20071115/67406ab6/attachment.bin