[Corpora-List] Which webscraping tools do k researchers use? . . .

Albretch Mueller lbrtchx at gmail.com
Thu Jul 1 19:22:07 CEST 2021


On 7/1/21, Johannes Kiesel <johannes.kiesel at uni-weimar.de> wrote:
> Depending on your constraint, you may want to look into web archiving
> tools.
> You may find our web archiver useful:
> https://github.com/webis-de/webis-web-archiver

...
> An alternative might be the Internet Archive's Brozzler:
> https://github.com/internetarchive/brozzler

~

https://github.com/internetarchive/brozzler

"browser" | "crawler" = "brozzler"

Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. It employs youtube-dl to enhance media capture capabilities and rethinkdb to manage crawl state.

Brozzler is designed to work in conjuction with warcprox for web archiving.

Requirements:

* Python 3.5 or later

* RethinkDB deployment

* Chromium or Google Chrome >= version 64

Note: The browser requires a graphical environment to run. When brozzler is run on a server, this may require deploying some additional infrastructure, typically X11. Xvnc4 and Xvfb are X11 variants that are suitable for use on a server, because they don't display anything to a physical screen. The vagrant configuration in the brozzler repository has an example setup using Xvnc4. (When last tested, chromium on Xvfb did not support screenshots, so Xvnc4 is preferred at this time.) ~

https://github.com/webis-de/webis-web-archiver

webis-web-archiver

Source code and scripts for the Webis Web Archiver.

If you use the archiver, please cite the paper that describes it in detail:

https://webis.de/downloads/publications/papers/kiesel_2018c.pdf

Quickstart

You need to have Docker installed:

https://www.docker.com/pricing

Then, on a Unix machine:

run src-bash/archive.sh for archiving web pages. It will display usage hints.

run src-bash/reproduce.sh for reproducing from an archive. It will display usage hints.

The scripts will automatically download and run the image (2GB+ due to all the fonts).

For other OSes, have a look at the shell scripts and adjust the call to docker run accordingly. ~

Unfortunately I don‘t see the options you have presented to me as a solution to the kind of problem that I have in mind.

I want to run a squid caching proxy server with a dynamically adjustable ICAP extension (which I will most probably hook up to some adjustable java code parsing and dealing with all that js, web bugs, ... goo) on a low end Raspberry PI card. All your options include a browser, which may even be more demanding than a java installation with Nashorn or maybe even a graal installation if you were to remove all packages you don‘t need (which may not be quite legal, right?, but I would like to test such a thing, anyway).

Another option which I think I have is running everything in the same computer and run a browser (most probably based on JavaFX‘s Webview) in transparent proxy mode, but why running two browser engines?

Is there a way you know of handling javascript code without using a full blown browser engine?

lbrtchx



More information about the Corpora mailing list