[Corpora-List] Keywords Generator

Trevor Jenkins trevor.jenkins at suneidesis.com
Mon Feb 18 18:47:48 CET 2008


On Mon, 18 Feb 2008, True Friend <true.friend2004 at gmail.com> wrote:


> Trevor Jenkins: Sorry I forgot to mention the size it was in words,
> 1.9million words. I also thought that large amount of data is the
> reason.

Oh okay. So roughly around 8Mb to 12Mb based on an average (English) word length of say 6 characters. I ran my pipe of filters across the Jane Austen texts including the juvenalia (which came to about 11Mb); no problem at all other than that all the words were stuffed into one result file. On a MacBook Pro with Intel Dual Core processor it took a matter of seconds to create the (2.5Mb) result file.

Personally I don't consider 1.9million words to be large. I once had a junior programmer who managed to stuff an 8Mb sentence into one record.

Regards, Trevor

<>< Re: deemed!



More information about the Corpora mailing list