[Corpora-List] Web Content Extractor / Screen Scraper

Eric Atwell eric at comp.leeds.ac.uk
Mon Jun 18 23:56:01 CEST 2007


Resty,

take a look at the CORPORA archive for web-as-corpus tools:

http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0705&L=CORPORA&P=R1226&I=-3

"... You can use a web-as-corpus collection tool such as WWW-Bootcat,
a web-interface to Baroni's perl BootCat:
http://corpora.fi.muni.cz/bootcat/

or WeBoCa, a Java alternative by Leeds student Michael Drayson, an
extension of Andy Roberts' JBootCat: http://code.google.com/p/weboca/
..."

Eric Atwell, Leeds University


On Tue, 19 Jun 2007, Resty Cena wrote:


> Hello,

> I am looking for a free or open-source Windows utility/application that

> extract text-only rendered (not raw) contents of web pages, such as one

> would use for automatically scraping news feeds. Does anyone use such an

> application?

>

> Basically the application will be used to harvest texts on the internet to

> build a corpus.

>

> All the best,

> Resty

>







More information about the Corpora-archive mailing list