[Corpora-List] Getting articles from newspapers to compile a corpus

Angus B. Grieve-Smith grvsmth at panix.com
Sat Dec 1 20:17:03 CET 2012


On 11/29/2012 10:52 PM, True Friend wrote:
> I have a related question:News websites (these days) are using AJAX,
> this hides links while simultaneously generates them via javascript.
> See this page
> <http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials>
> for example. Apparently this is the archive page for all Editorials on
> the newspaper website, but only a few are shown, and user has to click
> on "Show more news" under the given stories to get a few more previous
> editorials. Would an html crawler be able to bypass this and get all
> links hidden on this page?
>

It is possible. Certainly, anyone with enough programming skill could write an HTML crawler that can give an AJAX website the information it's looking for. In practice, it may be so obfuscated that it's not worth the time and effort.

-- Angus B. Grieve-Smith grvsmth at panix.com

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3280 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121201/aa7f7d5a/attachment.txt>



More information about the Corpora mailing list