[Corpora-List] Which webscraping tools do k researchers use? . . .

Adrien Barbaresi barbaresi at bbaw.de
Fri Jul 2 18:50:57 CEST 2021


Dear Albretch,

I am working on a web crawling, scraping and corpus construction tool which can be run with Python, R or on the command-line.

It is currently used daily in production, notably to build monitor corpora for the ZDL/DWDS (where I work), the Internet Archive's sandcrawler project, or SciencesPo's médialab.

Documentation: https://trafilatura.readthedocs.io/ Software: https://github.com/adbar/trafilatura

The software combines crawling, download, extraction and format conversion functions. The latter two can be used in combination with the crawling, rendering and archiving tools mentioned in this thread, by using HTML files (with JavaScript rendered or not) as input in order to extract article/main text, comments and metadata. The resulting information can be exported as TXT, CSV, JSON or XML.

Concerning JavaScript and interaction with webpages, you could have a look at pupetteer or its Python port pypetteer: https://github.com/puppeteer/puppeteer/ https://github.com/pyppeteer/pyppeteer

Finally, here are complete examples of interaction with web archives: https://github.com/GLAM-Workbench/web-archives

I hope this helps! Best, Adrien



More information about the Corpora mailing list