[Corpora-List] Serbian resources wanted

Vlado Keselj vlado at cs.dal.ca
Sat Feb 23 01:09:03 CET 2013


Hi Martin,

Resources prepared in our paper:

Vlado Keselj and Danko Sipka. A Suffix Subsumption-based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources. In INFOTHECA, Journal of Informatics and Librarianship, No 1-2, Volume IX, May 2008.

are available at: http://web.cs.dal.ca/~vlado/nlp/2007-sr/

among other resource files, they include lists lemmatized words:

list-l: 47489 lemmas (0.47 KB) list-w: 675140 word-forms (7.3 MB) list-w-l: 696454 word-form/lemma pairs (14.6 MB)

Regards, Vlado

On Fri, 22 Feb 2013, Adam Kilgarriff wrote:


> Hi Martyn,
>
> we have a Serbian corpus in the Sketch Engine so all she needs to do is
> upload her corpus and then run 'keywords' to compare hers with the
> reference.
>
> The one that is currently available is not lemmatised so comparisons there
> would be wordform-baed, however we are lemmatising and POS-tagging a newer,
> bigger dataset (courtesy of Nikola Ljubešić) as we speak so can make that
> available too, then she can get key lemmas. If you or she ask, we can make
> a big sample of the lemmatised material available at a day or two's notice
>
> Best
>
> Adam
>
>
> On 22 February 2013 15:39, Martin Wynne <martin.wynne at it.ox.ac.uk> wrote:
>
> > I would like to pose a question on behalf of a student who would like to
> > generate keywords by comparing her corpus of contemporary online personal
> > ads in Serbian with a reference corpus.
> >
> > Does anyone know of any freely available wordlists for the modern Serbian
> > language? Ideally, we'd like a lemma frequency list generated from a
> > general reference corpus, although lists from various other text types
> > could be useful. We'd be interested if there is a corpus available to use
> > as well.
> >
> > Many thanks for any help.
> >
> >
> > --
> > Martin Wynne
> > IT Services, University of Oxford
> > Oxford e-Research Centre
> > Faculty of Linguistics, Philology and Phonetics
> >
> > martin.wynne at it.ox.ac.uk
> >
> >
> >
> > ______________________________**_________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
> >
>
>
>
> --
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director Lexical Computing
> Ltd<http://www.sketchengine.co.uk/>
>
> Visiting Research Fellow University of
> Leeds<http://leeds.ac.uk>
>
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>
> *DANTE: a lexical database for
> English<http://www.webdante.com>
> *
> ========================================
>



More information about the Corpora mailing list