I'm using a parser (link-grammar) which allows me to attach to every word of a sentence a pattern (a "disjunct") that defines how that word was used in the sentence. One can think of the disjunct as being a very fine-grained part of speech: for example, it distinguishes not only transitive and intransitive verbs, but transitive verbs from ditransitive ones, or those that took particles, or even had singular vs. plural objects, etc. The disjunct precisely captures the syntactical usage of a given word in a given sentence.
The attached graph shows rank versus frequency of usage, taken from a corpus of about 1M sentences from Wikipedia articles. There is a nice long tail, showing a Zipfian power-law distribution, with exponent 1.5. There is also a knee at the highest ranks: the most frequent disjuncts are less frequent than they "should be" for a pure Zipfian distribution.
The questions are then: 1) Why a power law of 1.5? 2) Why is there a knee? 3) What about other languages?
I blogged this in slightly more detail at:
--linas -------------- next part -------------- A non-text attachment was scrubbed... Name: disjunct-true-rank.png Type: image/png Size: 4704 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20090707/9952c250/attachment.png>