[Corpora-List] Which webscraping tools do k researchers use? . . .

Johannes Kiesel johannes.kiesel at uni-weimar.de
Thu Jul 1 08:20:01 CEST 2021


Hi Albretch,

Depending on your constraint, you may want to look into web archiving tools.

You may find our web archiver useful:

https://github.com/webis-de/webis-web-archiver

It does use Selenium to render the web page and then scrolls down the web page to ensure other content "down the page" is fetched. The output it generates contains an HTML file that corresponds to the page after the interactions, and another file that contains the text content of every DOM node. You may also find the generated web archive file useful in the future. It relies on Docker, so you you should be able to get it to run quickly in case you are familiar with this (the repository contains a small wrapper script for Linux/Unix).

An alternative might be the Internet Archive's Brozzler:

https://github.com/internetarchive/brozzler

Regards, Johannes

On 30.06.21 17:04, Albretch Mueller wrote:
> I care mostly about full texts, which could be about literature,
> technical, legal matters . . .
> The main problem I have found is with content generating javascript
> code in web pages.
> So, what do you use?
> lbrtchx
> corpora at uib.no: Which webscraping tools do k researchers use? . . .
> web scraping
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>

-- Johannes Kiesel

Bauhaus-Universitšt Weimar Bauhausstr. 11, Room 109 99423 Weimar, Germany

Phone: +49 (0)3643 - 58 3720



More information about the Corpora mailing list