[Corpora-List] Cleaning text to take word frequency

Alexandre Rafalovitch arafalov at gmail.com
Sun Jun 1 15:30:03 CEST 2008


The way I would have approached this is by finding which words generate count discrepancies and also exist in one, but not another version of the result. Then, I would look for those words in the text and see what context they are in.

What I suspect you will find is that your partial reimplementation of perl's :punct: class is causing problems. I would either do a complete reimplementation of that (see: http://en.wikipedia.org/wiki/Regular_expression ) or look into C#'s regular expressions, which I am sure will contain the same definition of the :punct: class.

Finally, if you are working with languages other than English, you most certainly should look into regular expression libraries. They take into account Unicode's rules as well, something you really don't want to have to duplicate in your own code.

Regards,

Alex.

-- Personal blog: http://blog.outerthoughts.com/ Research group: http://www.clt.mq.edu.au/Research/

On Sun, Jun 1, 2008 at 7:07 AM, True Friend <true.friend2004 at gmail.com> wrote:
>
> HI
> I am a corpus linguistics student and learning C# for this purpose as well.
> I've created a simple application to find the frequency of a given word in
> two files. Actually this simple application is a practice version in C# of a
> Perl script a respected subscriber of this list (Alexander Schutz) written
> for me on my request on this list. I needed it then, now I am trying to
> programm myself so I tried to implement that idea in C#.



More information about the Corpora mailing list