[Corpora-List] tool for extracting text from web forum and websites

Timothy Baldwin tb at ldwin.net
Thu Oct 15 05:35:05 CEST 2009


Hi Isabella,


> I need a tool for extracting all the text from pages and subpages of a Web
> Forum. I do not need a cleaning tool at the moment.
>
> Can you suggest a tool to perform this operation?

We developed SiteScraper (http://sitescraper.googlecode.com) at Melbourne University for exactly this purpose -- scraping threads from web user forums, maintaining as much structure as possible (e.g. posts, titles, thread titles, timestamps, post authors). You will need to provide a couple of training instances (literally a handful), but otherwise, it should just work. Email me off list if you are after more details.

Tim



More information about the Corpora mailing list