[Corpora-List] Getting articles from newspapers to compile a corpus

cagri coltekin c.coltekin at rug.nl
Fri Nov 30 01:21:48 CET 2012


On Thu, Nov 29, 2012 at 10:54:46PM +0100, Matías Guzmán wrote:
>
> I already tried wget, it seems to work quite well, but I wasn't able to
> clean the html files it creates using BeautifulSoup for python. Maybe
> somebody know of other software capable of doing this?

For cleaning HTML files, JusText (http://code.google.com/p/justext/) might be what you are looking for. If you also want to remove duplicate or near-duplicate documents, you need another tool, or write your own.

Cagri



More information about the Corpora mailing list