[Corpora-List] Keywords Generator

Alexander Schutz goalscoringsuperstarhero at gmail.com
Mon Feb 18 19:35:02 CET 2008


Shakir,

I am pretty sure this is not the right forum to give support for a quickly hacked perl script, and the diagnosis of what went wrong is too much a speculation given the bug report. However, maybe I should have been clearer in the usage instructions:

So basically what you need to do is check the input format. Your wordlist seems to be ok (from what I can see in your output sample). The corpora need to be plain text, and one text-file each (again, see the example input for inspiration)

I tested the script with a wordlist of nouns extracted from the bnc frequency list, and as corpusA the europarl-corpus (en, with no tags) and as corpusB a collection of Charles Dickens novels (from Gutenberg).

Again, both corpora (I was hoping the provided example was sufficiently illustrative) must be plain text files, and size should not be a problem, given the fact that I was able to process the europarl (28m tokens) -- *AND* Charles Dickens ;-) , and it only takes a couple of seconds, and it produced the desired output.

Perhaps it would have been better to come up with a unix-shell pipe example so that you can see how to do "stuff" quickly yourself, and provide references so you are not lost, and can educate yourself when you reach the limitations

of the unix shell one-liner. The pointers so far given are excellent resources to get your hands dirty quickly, really without having to learn everything about programming. More helpful resources include the perl man-pages ('man perl' or 'man perlintro') in the unix-shell, hopefully your system administrator has them installed for you. I can do a bit more documentation on the script, but I suggest we handle that in private communication.

Now, I hope there won't be much need to continue this thread. Sorry, but vanity is my favourite sin ;-)

Kind regards, Alex

On Feb 18, 2008 4:56 PM, True Friend <true.friend2004 at gmail.com> wrote:


> Hi Sir
> Tried your script but ........ it has some problems. Probably the large
> size of txt files was the reason. Corpus A was about 1.9 million and
> corpus B was almost as A. It generated only "0"s for each word. Another
> thing was probably big size of wordlist (1000 words). A glimpse of the
> result.
> votes 0 0
> whereas 0 0
> whereby 0 0
> wherein 0 0
> without 0 0
> witness 0 0
> witnesses 0 0
> wound 0 0
> writ 0 0
> written 0 0
> zila 0 0
> zina 0 0
>  court 0 0
> When tried with small wordlist it generated only one word (the last one *
> court*) plz see the result.
> judge 0 0
> judgment 0 0
> land 0 0
> law 0 0
> learned 0 0
> order 0 0
> ordinance 0 0
> person 0 0
> petition 0 0
> petitioner 0 0
> police 0 0
> record 0 0
> respondent 0 0
> section 0 0
> suit 0 0
> trial 0 0
> court 718 11128
> A procedure which I could make in my mind was like grab the word find its
> frequency in Corpus A and then in Corpus B and then print it. I could not
> understand the code (not a programmer yet :D), anyhows there is something
> wrong. So can you spare some more time for it?
> Thanks a lot for your effort to write this script.
> Regards
> M Shakir
> Pakistan
>
> On Feb 18, 2008 5:34 PM, Alexander Schutz <
> goalscoringsuperstarhero at gmail.com> wrote:
>
> > Hi Shakir,
> >
> > as part of a little exercise I wrote a tiny perl script performing what
> > you asked.
> > It takes as parameters the wordlist, the corpus_A and the corpus_B (each
> > as text files)
> > and produces as output the respective frequencies in each corpus:
> > alesch at nbgal141:~$ perl wordlist_corpus_freq.pl wordlist.txt vbush.txt
> > How2DoResearchMIT.txt
> > color 1 0
> > colour 0 0
> > furiously 0 0
> > green 0 0
> > idea 7 22
> > sleep 0 0
> >
> > It does some normalisation on the corpora, like conversion to lower case
> > and
> > punctuation removal.
> >
> > Please find it as attachment, including the sample wordlist, to this
> > email.
> >
> > Hth,
> > Alex
> >
> >
> >
> > On Feb 18, 2008 10:53 AM, True Friend <true.friend2004 at gmail.com> wrote:
> >
> > > Hi Folks
> > > I need a a programm/script (even of *nix) that can provide frequency
> > > of a wordlist from two corpora. Actually I have made this list by comparing
> > > two word lists one from general english (specifically from Pakistani Origin)
> > > and law english (also of Pakistani origin). I know want to present these
> > > keywords with their frequencies in both corpora as a proof that these words
> > > are more frequent in law. Keywords are generated by Antconc.
> > > Is there any script/tool that can generate a parallel list of
> > > frequencies of each word in both corpora?
> > > Regards
> > > M Shakir Aziz
> > > A Corpus Linguistics Student
> > > Pakistan
> > >
> > > --
> > > محمد شاکر عزیز
> > > _______________________________________________
> > > Corpora mailing list
> > > Corpora at uib.no
> > > http://mailman.uib.no/listinfo/corpora
> > >
> > >
> >
> >
> > --
> > Alexander Schutz,
> > Digital Enterprise Research Institute,
> > Ollscoil na hÉireann, Gaillimh
> > Galway, Ireland
>
>
>
>
> --
> محمد شاکر عزیز

-- Alexander Schutz, Digital Enterprise Research Institute, Ollscoil na hÉireann, Gaillimh Galway, Ireland -------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080218/6ada26e3/attachment.html



More information about the Corpora mailing list