[Corpora-List] Keywords Generator

Ben Allison ben at dcs.shef.ac.uk
Mon Feb 18 12:57:29 CET 2008


If you're happy with *nix, this can be done in a couple of lines -- first produce a list of frequency counted words for each of the two corpora, then grep for the word you're interested in in both lists. Depending on how reliable you want word detection to be, the first part might be a little more complicated (especially if you have odd formatting/encoding issues), but the second is just a singe line.

A very simple script for determining the count of a word SOME_WORD in the corpus myfile.txt (assuming the script words.pl I include below) would be:

./words.pl < myfile.txt | sort | uniq -c | grep 'SOME_WORD'

(replace myfile.txt and SOME_WORD with appropriate strings) Although there are no doubt better ways... Also, you may wish to consider normalised frequency, since raw counts are not going to be great for comparison if the corpora are of different lengths.






while (/\b([a-z']+)\b/g){

print "$1\n";

} }

True Friend wrote:
> Hi Folks
> I need a a programm/script (even of *nix) that can provide frequency
> of a wordlist from two corpora. Actually I have made this list by
> comparing two word lists one from general english (specifically from
> Pakistani Origin) and law english (also of Pakistani origin). I know
> want to present these keywords with their frequencies in both corpora
> as a proof that these words are more frequent in law. Keywords are
> generated by Antconc.
> Is there any script/tool that can generate a parallel list of
> frequencies of each word in both corpora?
> Regards
> M Shakir Aziz
> A Corpus Linguistics Student
> Pakistan
> --
> محمد شاکر عزیز
> ------------------------------------------------------------------------
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list