I have noticed that there seems to a problem with the treatment of "won't" in the Google ngram corpus. We would expect it to occur about 100 million times, but it seems to have disappeared or be tokenized in a non-standard way. We would expect it to appear in the Penn style as "wo" + "n't"
As a unigram we get
won't 37251 wont 3677346 wo 1226869
in the bigrams we get things like
I 'm 188587483
but nothing that I can find that corresponds to "won't".
Has anyone else noticed this?
regards
Alex
-- Alexander Clark alexc at cs.rhul.ac.uk http://www.cs.rhul.ac.uk/home/alexc/ Lecturer, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX Direct 01784 443430 Department 01784 434455 Fax 01784 439786