[Corpora-List] Getting articles from newspapers to compile a corpus

True Friend true.friend2004 at gmail.com
Fri Nov 30 04:52:21 CET 2012


I have a related question: News websites (these days) are using AJAX, this hides links while simultaneously generates them via javascript. See this page<http://www.nation.com.pk/pakistan-news-newspaper-daily-english-online/opinions/editorials>for example. Apparently this is the archive page for all Editorials on the newspaper website, but only a few are shown, and user has to click on "Show more news" under the given stories to get a few more previous editorials. Would an html crawler be able to bypass this and get all links hidden on this page? Regards

On Fri, Nov 30, 2012 at 8:35 AM, Angus Grieve-Smith <grvsmth at panix.com>wrote:


> On 11/29/2012 4:28 PM, Linda Bawcom wrote:
>
> Because so many newspapers get their information from the same news
> services, I found a few articles that I had to disgard because of an over
> 80% similarity ratio and of course that skews statistics.
>
>
> Good point! Some newspapers will abridge the wire stories more than
> others, so it might be useful to find a way to choose the longest version.
>
> --
> -Angus B. Grieve-Smith
> grvsmth at panix.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

-- *Muhammad Shakir Aziz* *محمد شاکر عزیز* *Master in Applied Linguistics Translator, Course Developer, Linguist for Urdu, Punjabi and English* Urdu:- http://awaz-e-dost.blogspot.com/ English:- http://linguisticslearner.blogspot.com/ Facebook:- http://www.facebook.com/truefriend2004 Skype:- true_friend2004 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4889 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121130/7994055f/attachment.txt>



More information about the Corpora mailing list