[Corpora-List] Number of unique words in text for different languages

Justin Washtell lec3jrw
Sun Aug 15 03:20:30 CEST 2010


Or is morphology merely in the preparation of the fruit, the preferred slicing?

Justin Washtell University of Leeds

________________________________________ From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of David Wible [wible at stringnet.org] Sent: 14 August 2010 09:55 To: fatima zuhra Cc: corpora at uib.no; Emmanuel Prochasson Subject: Re: [Corpora-List] Number of unique words in text for different languages

Is there an 'apple and oranges' dimension to this question when it involves comparing relatively isolating languages to more synthetic ones? Is picking the morphological analyzer or its settings going to bring us closer to an 'apples and apples' comparison when, to begin with, morphologically speaking we've got two different fruits in hand with the two languages being compared?

David Wible Dean, College of Humanities National Central University Jhongli. Taiwan

On Saturday, August 14, 2010, fatima zuhra <fateeshah at yahoo.com> wrote:
>
> Perhaps page 5 of the paper, available from the following URL, contains useful information in this regard:
> http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf
>
> Regards.
>
> Fatima Tuz Zuhra
> Department of Computer Science,
> University of Peshawar. Peshawar. Pakistan.
> --- On Tue, 8/10/10, Emmanuel Prochasson <eprochasson at gmail.com> wrote:
>
>
> From: Emmanuel Prochasson <eprochasson at gmail.com>
> Subject: [Corpora-List] Number of unique words in text for different languages
> To: corpora at uib.no
> Date: Tuesday, August 10, 2010, 12:11 PM
>
> Dear all,
>
> I am working on a trilingual comparable corpus of French/English and
> Japanese. I am running a simple word count on each part of the corpus
> but found surprising results for Japanese.
>
> For each part, I count the total number of words and the number of
> /unique words/, that is I count every words only once, even if they
> appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
> only keep the lemmatized version of every words (to group different
> flexion of one words). Furthermore, I only focus on nouns, keeping the
> "??:??" tag for Japanese (noun:general) and all noun (including proper
> nouns) in French/English. I use MeCab for Japanese and TreeTagger for
> French/English.
>
> Here are the results (Unique words/Total words).
> Japanese : 189,798 / 5,174,800
> English : 66,821 / 4,589,465
> French : 23,970 / 1,796,183
>
> What surprises me is that the number of unique
> nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistant for example).
>
> As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
> checked using google translate, it seems to make sense, but my lack of
> skill in Japanese prevents me from investigating deeper).
>
> Is this a normal result ?
>
> Regards,
>
> --
> Emmanuel Prochasson
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no <http://us.mc343.mail.yahoo.com/mc/compose?to=Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list