[Corpora-List] Inquiry on the usage of rainbow for text classification

F Su fzsu at comp.leeds.ac.uk
Sat Feb 23 18:29:31 CET 2008

Dear all,

Does anybody have experience in using rainbow for text classification (a toolkit written by Andrew McCallum and here is the link about it http://www.cs.cmu.edu/~mccallum/bow/rainbow/)?

I have read the usage document, it says that the basic setting is, the text data should be in plian text files, one file per document.

But it also says that it can Finding `document' boundaries when there are multiple documents per file. This make me believe that one file can also contain more than one documents. But I haven't found out the exact soluction to it from the usage document.

My question is that, if a file contains more than one documents (for example, news are gathered in a file), not only a document, is it possilble to apply the rainbow software directly? or I have to extract each news and save it in a file seperately? Of course I can preprocess in this way, but as in our dataset, each document is very short (around 10 words), and we have more than 100,000 document, so we prefer to save them in a file.

Any guidance will be highly appreciated.

Thanks, Fangzhong


More information about the Corpora mailing list