[Corpora-List] Keywords Generator

True Friend true.friend2004 at gmail.com
Mon Feb 18 17:56:12 CET 2008


Hi Sir Tried your script but ........ it has some problems. Probably the large size of txt files was the reason. Corpus A was about 1.9 million and corpus B was almost as A. It generated only "0"s for each word. Another thing was probably big size of wordlist (1000 words). A glimpse of the result.

votes 0 0

whereas 0 0

whereby 0 0

wherein 0 0

without 0 0

witness 0 0

witnesses 0 0

wound 0 0

writ 0 0

written 0 0

zila 0 0

zina 0 0

 court 0 0 When tried with small wordlist it generated only one word (the last one * court*) plz see the result.

judge 0 0

judgment 0 0

land 0 0

law 0 0

learned 0 0

order 0 0

ordinance 0 0

person 0 0

petition 0 0

petitioner 0 0

police 0 0

record 0 0

respondent 0 0

section 0 0

suit 0 0

trial 0 0

court 718 11128 A procedure which I could make in my mind was like grab the word find its frequency in Corpus A and then in Corpus B and then print it. I could not understand the code (not a programmer yet :D), anyhows there is something wrong. So can you spare some more time for it? Thanks a lot for your effort to write this script. Regards M Shakir Pakistan

On Feb 18, 2008 5:34 PM, Alexander Schutz < goalscoringsuperstarhero at gmail.com> wrote:


> Hi Shakir,
>
> as part of a little exercise I wrote a tiny perl script performing what
> you asked.
> It takes as parameters the wordlist, the corpus_A and the corpus_B (each
> as text files)
> and produces as output the respective frequencies in each corpus:
> alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> How2DoResearchMIT.txt
> color 1 0
> colour 0 0
> furiously 0 0
> green 0 0
> idea 7 22
> sleep 0 0
>
> It does some normalisation on the corpora, like conversion to lower case
> and
> punctuation removal.
>
> Please find it as attachment, including the sample wordlist, to this
> email.
>
> Hth,
> Alex
>
>
>
> On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com> wrote:
>
> > Hi Folks
> > I need a a programm/script (even of *nix) that can provide frequency of
> > a wordlist from two corpora. Actually I have made this list by comparing two
> > word lists one from general english (specifically from Pakistani Origin) and
> > law english (also of Pakistani origin). I know want to present these
> > keywords with their frequencies in both corpora as a proof that these words
> > are more frequent in law. Keywords are generated by Antconc.
> > Is there any script/tool that can generate a parallel list of
> > frequencies of each word in both corpora?
> > Regards
> > M Shakir Aziz
> > A Corpus Linguistics Student
> > Pakistan
> >
> > --
> > محمد شاکر عزیز
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
> >
>
>
> --
> Alexander Schutz,
> Digital Enterprise Research Institute,
> Ollscoil na hÉireann, Gaillimh
> Galway, Ireland

-- محمد شاکر عزیز -------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080218/3dae983f/attachment.html



More information about the Corpora mailing list