[Corpora-List] Keywords Generator

Trevor Jenkins trevor.jenkins at suneidesis.com
Mon Feb 18 18:23:19 CET 2008


On Mon, 18 Feb 2008, True Friend <true.friend2004 at gmail.com> wrote:


> Hi Sir
> Tried your script but ........ it has some problems. Probably the large
> size of txt files was the reason. Corpus A was about 1.9 million and
> corpus B was almost as A.

I'll leave Alex to comment on the use of his script but I wonder what you are reporting here with these numbers. Do you 1.9 million documents, words, characters.

The texts I used for my pipe-line script are all about 1.9Mb (1.9 million characters) in size. The individual filters I used do not have a problem processing that amount of data; I've processed larger stuff with the same piple-line.

It might be that Alex's quick script can't cope with the volumes of information you are throwing at it. And either you'll have to use something else or to improve the script to cope with large volumes.

Regards, Trevor

<>< Re: deemed!



More information about the Corpora mailing list