[Corpora-List] Which webscraping tools do k researchers use? . . .

David Chartash dchartas at ieee.org
Thu Jul 1 14:51:33 CEST 2021


Hi Albretch, If you're running into javascript problems, I'd also recommend a package such as `dukpy` <https://pypi.org/project/dukpy/>. I've integrated it with `selenium-requests` < https://pypi.org/project/selenium-requests/> successfully to parse through issues of javascript rendering on individual web pages... Cheers,

David -- David Chartash BESc (Western Ontario), MHSc (Toronto), PhD (Indiana) School of Medicine, University College Dublin - National University of Ireland, Dublin Scoil an Leighis, An Coláiste Ollscoile Baile Átha Cliath - Ollscoil na hÉireann, Baile Átha Cliath UCD Health Sciences Centre, Bellfield, County Dublin, Republic of Ireland, Dublin 4

Notice of Confidentiality The information contained in this email and in any attachments is confidential and is designated solely for the attention and use of the intended recipient(s). This information may be subject to legal and professional privilege. If you are not an intended recipient of this email, you must not use, disclose, copy, distribute or retain this message or any part of it. If you have received this email in error, please notify the sender immediately and delete all copies of this email from your computer system(s).

Avis de confidentialité L'information transmise dans cet courrier électronique est confidentiel et a désignée pour la personne ou à l’organisme auquel elle est adressée. Cet information est peuvent être soumises au privilège juridique et professionnel. Si vous avez reçu cette information par erreur, veuillez contacter son expéditeur immédiatement par retour du courrier électronique puis supprimer cette information y compris toutes pièces jointes sans en avoir copié, divulgué, ou diffusé le contenu.

Fógra Rúndachta Tá an t-eolas sa ríomhphost seo, agus in aon cheangláin leis, faoi phribhléid agus faoi rún agus le haghaigh an tseolaí amháin. D’fhéadfadh an t-eolas seo a bheith faoi phribhléid phroifisiúnta nó dhlíthiúil. Mura tusa an seolaí a bhí beartaithe leis an ríomhphost seo a fháil, tá cosc air, nó aon chuid de, a úsáid, a chóipeáil, nó a scaoileadh. Má tháinig sé chugat de bharr dearmaid, téigh i dteagmháil leis an seoltóir agus scrios an t-ábhar ó do ríomhaire le do thoil.

On Thu, Jul 1, 2021 at 2:20 AM Johannes Kiesel < johannes.kiesel at uni-weimar.de> wrote:


> Hi Albretch,
>
> Depending on your constraint, you may want to look into web archiving
> tools.
>
>
> You may find our web archiver useful:
>
> https://github.com/webis-de/webis-web-archiver
>
> It does use Selenium to render the web page and then scrolls down the
> web page to ensure other content "down the page" is fetched. The output
> it generates contains an HTML file that corresponds to the page after
> the interactions, and another file that contains the text content of
> every DOM node. You may also find the generated web archive file useful
> in the future. It relies on Docker, so you you should be able to get it
> to run quickly in case you are familiar with this (the repository
> contains a small wrapper script for Linux/Unix).
>
>
> An alternative might be the Internet Archive's Brozzler:
>
> https://github.com/internetarchive/brozzler
>
>
> Regards,
> Johannes
>
> On 30.06.21 17:04, Albretch Mueller wrote:
> > I care mostly about full texts, which could be about literature,
> > technical, legal matters . . .
> > The main problem I have found is with content generating javascript
> > code in web pages.
> > So, what do you use?
> > lbrtchx
> > corpora at uib.no: Which webscraping tools do k researchers use? . . .
> > web scraping
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
> >
>
> --
> Johannes Kiesel
>
> Bauhaus-Universität Weimar
> Bauhausstr. 11, Room 109
> 99423 Weimar, Germany
>
> Phone: +49 (0)3643 - 58 3720
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5621 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20210701/79f45787/attachment.txt>



More information about the Corpora mailing list