[Corpora-List] Fwd: Number of unique words in text for different languages

coffey at cli.unipi.it coffey
Thu Aug 12 17:45:04 CEST 2010


Quoting Jim Fidelholtz <fidelholtz at gmail.com>:


> Hi all,
>
> As a disclaimer, I have not worked with any of the tokenizers. For the type
> of results originally reported, however, I do have a suggestion for a
> possible partial explanation, based on some experience with Spanish. There
> is a real stylistic rule in Spanish which makes speakers and especially
> writers avoid repeating the same 'content word' within the same or
> contiguous sentences or clauses, using instead a synonym or paraphrase.

... and the same is true for Italian.

Steve Coffey.

---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.



More information about the Corpora mailing list