[Corpora-List] Getting articles from newspapers to compile a corpus

William Fletcher fletcher at usna.edu
Sun Dec 2 20:24:07 CET 2012


There are many free tools out there to scrape websites for specific content. This tutorial includes an example that is somewhat comparable: http://net.tutsplus.com/tutorials/javascript-ajax/web-scraping-with-node-js/

You might also take a look at Bobik: http://usebobik.com/ Bobik is a cloud-powered service for scraping websites in real time. You can use any language you want as Bobik's own API is entirely HTTP-based.

Regards, Bill Fletcher

On Sat, Dec 1, 2012 at 2:17 PM, Angus B. Grieve-Smith <grvsmth at panix.com>wrote:


> On 11/29/2012 10:52 PM, True Friend wrote:
>
> I have a related question: News websites (these days) are using AJAX,
> this hides links while simultaneously generates them via javascript. See this
> page<http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials>for example.
> Apparently this is the archive page for all Editorials on the newspaper
> website, but only a few are shown, and user has to click on "Show more
> news" under the given stories to get a few more previous editorials. Would
> an html crawler be able to bypass this and get all links hidden on this
> page?
>
>
> It is possible. Certainly, anyone with enough programming skill could
> write an HTML crawler that can give an AJAX website the information it's
> looking for. In practice, it may be so obfuscated that it's not worth the
> time and effort.
>
> --
> Angus B. Grieve-Smithgrvsmth at panix.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4277 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121202/df58b877/attachment.txt>



More information about the Corpora mailing list