[Corpora-List] tool for extracting text from web forum and websites

Stefan Th. Gries stgries at gmail.com
Fri Oct 16 00:39:04 CEST 2009


You can use R (<http://www.r-project.org/>) to download files and clean them easily: to load the contents of <http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html>, you just enter this at the console

(x <- gsub("<[^>]*?>", "", scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html", what=character(0), sep="\n", quote="", comment.char=""), perl=T))

or this (to print it into a file called <res.txt>):

x <- gsub("<[^>]*?>", "", scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html", what=character(0), sep="\n", quote="", comment.char=""), perl=T) cat(x, file="res.txt", sep="\n")

Cf. <http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf> for a more detailed application.

HTH, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries -----------------------------------------------



More information about the Corpora mailing list