[Corpora-List] Keywords Generator

True Friend true.friend2004 at gmail.com
Fri Feb 22 08:08:23 CET 2008


Thnx Mr. Schutz Now it is working fine. Only one thing I had to workaround was wordlist's auto generation. It didn't worked with wordlist generated by antconc so I manually types it and now it works fine. A few words I can see with 0 frequency, I'll correct them manually. Regards

On Tue, Feb 19, 2008 at 8:56 PM, Alexander Schutz < goalscoringsuperstarhero at gmail.com> wrote:


> Hi,
>
> I took the time to beautify and document the perl code a little bit, I
> hope
> it is a bit clearer now what is done. You can specify any number of corpus
> files on the command line, however the first file you specify must always
> be the wordlist file.
> Running the script on my machine yields the following:
>
> alesch at nbgal141:~/tmp$ perl wordlist_corpus_freq.pl wordlist.txt
> Charles_Dickens_-_David_Copperfield.txt
> James_Joyce_-_Ulysses_-_Text.txt Charles_Dickens_-_Oliver_Twist.txt
> reading wordlist : wordlist.txt
> processing corpus 0 : Charles_Dickens_-_David_Copperfield.txt
> processing corpus 1 : James_Joyce_-_Ulysses_-_Text.txt
> processing corpus 2 : Charles_Dickens_-_Oliver_Twist.txt
> color 0 0 0
> colour 25 23 5
> furious 2 5 7
> furiously 0 2 7
> green 39 55 24
> idea 93 55 14
> sleep 72 43 37
>
>
> If you have questions don't hesitate to get back to me,
>
> Hth,
> Alex
>
>
> On Feb 18, 2008 4:56 PM, True Friend <true.friend2004 at gmail.com> wrote:
> > Hi Sir
> > Tried your script but ........ it has some problems. Probably the large
> size of txt files was the reason. Corpus A was about 1.9 million and
> corpus B was almost as A. It generated only "0"s for each word. Another
> thing was probably big size of wordlist (1000 words). A glimpse of the
> result.
> > votes 0 0
> > whereas 0 0
> > whereby 0 0
> > wherein 0 0
> > without 0 0
> > witness 0 0
> > witnesses 0 0
> > wound 0 0
> > writ 0 0
> > written 0 0
> > zila 0 0
> > zina 0 0
> >  court 0 0
> > When tried with small wordlist it generated only one word (the last one
> court) plz see the result.
> > judge 0 0
> > judgment 0 0
> > land 0 0
> > law 0 0
> > learned 0 0
> > order 0 0
> > ordinance 0 0
> > person 0 0
> > petition 0 0
> > petitioner 0 0
> > police 0 0
> > record 0 0
> > respondent 0 0
> > section 0 0
> > suit 0 0
> > trial 0 0
> > court 718 11128
> > A procedure which I could make in my mind was like grab the word find
> its frequency in Corpus A and then in Corpus B and then print it. I could
> not understand the code (not a programmer yet :D), anyhows there is
> something wrong. So can you spare some more time for it?
> > Thanks a lot for your effort to write this script.
> > Regards
> > M Shakir
> > Pakistan
> >
> >
> >
> > On Feb 18, 2008 5:34 PM, Alexander Schutz <
> goalscoringsuperstarhero at gmail.com> wrote:
> >
> >
> >
> >
> > > Hi Shakir,
> > >
> > > as part of a little exercise I wrote a tiny perl script performing
> what you asked.
> > > It takes as parameters the wordlist, the corpus_A and the corpus_B
> (each as text files)
> > > and produces as output the respective frequencies in each corpus:
> > > alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> How2DoResearchMIT.txt
> > > color 1 0
> > > colour 0 0
> > > furiously 0 0
> > > green 0 0
> > > idea 7 22
> > > sleep 0 0
> > >
> > > It does some normalisation on the corpora, like conversion to lower
> case and
> > > punctuation removal.
> > >
> > > Please find it as attachment, including the sample wordlist, to this
> email.
> > >
> > > Hth,
> > > Alex
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com>
> wrote:
> > >
> > > >
> > > >
> > > >
> > > > Hi Folks
> > > > I need a a programm/script (even of *nix) that can provide frequency
> of a wordlist from two corpora. Actually I have made this list by comparing
> two word lists one from general english (specifically from Pakistani Origin)
> and law english (also of Pakistani origin). I know want to present these
> keywords with their frequencies in both corpora as a proof that these words
> are more frequent in law. Keywords are generated by Antconc.
> > > > Is there any script/tool that can generate a parallel list of
> frequencies of each word in both corpora?
> > > > Regards
> > > > M Shakir Aziz
> > > > A Corpus Linguistics Student
> > > > Pakistan
> > > >
> > > > --
> > > > محمد شاکر عزیز
> > > >
> > > > _______________________________________________
> > > > Corpora mailing list
> > > > Corpora at uib.no
> > > > http://mailman.uib.no/listinfo/corpora
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Alexander Schutz,
> > > Digital Enterprise Research Institute,
> > > Ollscoil na hÉireann, Gaillimh
> > > Galway, Ireland
> >
> >
> >
> > --
> > محمد شاکر عزیز
>
>
>
> --
> Alexander Schutz,
> Digital Enterprise Research Institute,
> Ollscoil na hÉireann, Gaillimh
> Galway, Ireland
>

-- محمد شاکر عزیز -------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080222/3c2edb15/attachment.html



More information about the Corpora mailing list