[Corpora-List] Handling a Large Text Archive

Laurence Anthony anthony0122 at gmail.com
Wed Jan 4 16:48:39 CET 2012


On Wed, Jan 4, 2012 at 11:57 PM, True Friend <true.friend2004 at gmail.com>wrote:


> Hi
> I've a large text archive of 100+ million words in utf8 encoding
> (non-English text archive). Sometimes i need to get concordance, or word
> list but its size creates problem. I've tried AntConc (always hangs when I
> open the text files in it), as well as TextSTAT (works fine for concordance
> usually but hangs when a word list task is given). Any good free
> alternative to handle big text archives? Or any efficient way to handle
> such a large collection?
> Thanks a lot for taking time and reading this email. Your response will be
> highly appreciated.
> Regards
>
>
Hi,

AntConc is really designed for just a few million-word corpora. Also, it assumes that each corpus file is quite small. That's why you will find it hangs on 100+ word corpora. Saying that, I'm now working on a new version that will (hopefully) handle 100+ corpora smoothly. I'll announce it here when its ready.

Laurence Anthony -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1450 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120105/c4cb9a9d/attachment.txt>



More information about the Corpora mailing list