[Corpora-List] Most frequent 5K words in Icelandic?

Anton Karl Ingason ingason at ling.upenn.edu
Mon Nov 19 17:04:30 CET 2012


Hi Kim,

You can use the IcePaHC corpus to extract these frequencies. Although it is a historical corpus, it spans the period 12th-21st century, so you could use the texts from, say, the 19th-21st centuries, which represent the modern language well. IcePaHC is a free resource.

Note that the corpus is lemmatized and in addition to the treebank format, the main download includes formats which are more convenient for your purpose.

http://www.linguist.is/icelandic_treebank/Download

Unfortunately, it does not have English glosses, and I don't have any ideal solution for that, but you might get something useful by loooking words up in this list: http://linguist.is/dictionary (it uses a different tagset, and is quite limited, but it is also a free resource)

The two tagsets you would be interested in are described in these pages: http://www.linguist.is/icelandic_treebank/Tagset http://linguist.is/icelandic_treebank/IFD_Tagset

There is an LREC paper on IcePaHC: http://www.lrec-conf.org/proceedings/lrec2012/summaries/440.html

If you have any questions regarding IcePaHC, feel free to email me or any other member of the IcePaHC project.

Best, Anton

On Mon, Nov 19, 2012 at 9:05 AM, Thommy Mayer <thommy.mayer at gmail.com>wrote:


> Hi Kim,
>
> You could also check the "Frequency Dictionary Icelandic" from the
> Leipzig Wortschatz group or contact Uwe Quasthoff directly for the
> relevant data (quasthoff at informatik.uni-leipzig.de ).
>
> Quasthoff, Uwe, Sabine Fiedler, Erla Hallsteinsdóttir (ed.). 2012.
> Frequency Dictionary Icelandic (Íslensk tíðniorðabók). Band 3 der
> Reihe Frequency Dictionaries. Universitätsverlag, 109 S. (+CD-ROM).
>
> Regards,
> Thomas
>
> ---------------------------------------------------------------------------
> Thomas Mayer
> Research Unit "Quantitative Language Comparison"
> Forschungszentrum Deutscher Sprachatlas
> Philipps-Universität Marburg
> Hermann-Jacobsohn-Weg 3
> 35032 Marburg
>
> Current address:
> Geschwister Scholl Platz 1
> 80539 München, Germany
> Office: Schellingstraße 9, Raum 301
> Tel: +49 89 2180 6144
> ---------------------------------------------------------------------------
>
>
> 2012/11/19 Kim Witten <kimwitten at gmail.com>:
> > Hi Corpora Subscribers,
> > I'm wondering if somebody might be able to point me in the direction to
> find a simple list of the 5,000 most frequent words in Icelandic, from any
> (relatively current, non-historical) Icelandic corpus? With English gloss
> would be even better, but it's not necessary. Thanks!
> > -Kim
> > ---
> > Kim Witten, PhD candidate
> > Language & Linguistic Science
> > University of York, UK
> > kaw522 at york.ac.uk
> > www.MePhiD.com
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- www.linguist.is tel: 215-350-7215 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4896 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121119/dcb0b7a9/attachment.txt>



More information about the Corpora mailing list