David Wible Dean, College of Humanities National Central University Jhongli. Taiwan
On Saturday, August 14, 2010, fatima zuhra <fateeshah at yahoo.com> wrote:
>
> Perhaps page 5 of the paper, available from the following URL, contains useful information in this regard:
> http://gandalf.aksis.uib.no/non/lrec2000/pdf/262.pdf
>
> Regards.
>
> Fatima Tuz Zuhra
> Department of Computer Science,
> University of Peshawar. Peshawar. Pakistan.
> --- On Tue, 8/10/10, Emmanuel Prochasson <eprochasson at gmail.com> wrote:
>
>
> From: Emmanuel Prochasson <eprochasson at gmail.com>
> Subject: [Corpora-List] Number of unique words in text for different languages
> To: corpora at uib.no
> Date: Tuesday, August 10, 2010, 12:11 PM
>
> Dear all,
>
> I am working on a trilingual comparable corpus of French/English and
> Japanese. I am running a simple word count on each part of the corpus
> but found surprising results for Japanese.
>
> For each part, I count the total number of words and the number of
> /unique words/, that is I count every words only once, even if they
> appear 1, 5 or 100 times. I POS-tagged each part of the corpus and
> only keep the lemmatized version of every words (to group different
> flexion of one words). Furthermore, I only focus on nouns, keeping the
> "??:??" tag for Japanese (noun:general) and all noun (including proper
> nouns) in French/English. I use MeCab for Japanese and TreeTagger for
> French/English.
>
> Here are the results (Unique words/Total words).
> Japanese : 189,798 / 5,174,800
> English : 66,821 / 4,589,465
> French : 23,970 / 1,796,183
>
> What surprises me is that the number of unique
> nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistant for example).
>
> As far as I can tell, the tokenization/POS-tagging looks /ok/ (ie : I
> checked using google translate, it seems to make sense, but my lack of
> skill in Japanese prevents me from investigating deeper).
>
> Is this a normal result ?
>
> Regards,
>
> --
> Emmanuel Prochasson
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no <http://us.mc343.mail.yahoo.com/mc/compose?to=Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>