You might also take a look at Bobik: http://usebobik.com/ Bobik is a cloud-powered service for scraping websites in real time. You can use any language you want as Bobik's own API is entirely HTTP-based.
Regards, Bill Fletcher
On Sat, Dec 1, 2012 at 2:17 PM, Angus B. Grieve-Smith <grvsmth at panix.com>wrote:
> On 11/29/2012 10:52 PM, True Friend wrote:
> I have a related question: News websites (these days) are using AJAX,
> page<http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials>for example.
> Apparently this is the archive page for all Editorials on the newspaper
> website, but only a few are shown, and user has to click on "Show more
> news" under the given stories to get a few more previous editorials. Would
> an html crawler be able to bypass this and get all links hidden on this
> It is possible. Certainly, anyone with enough programming skill could
> write an HTML crawler that can give an AJAX website the information it's
> looking for. In practice, it may be so obfuscated that it's not worth the
> time and effort.
> Angus B. Grieve-Smithgrvsmth at panix.com
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4277 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121202/df58b877/attachment.txt>