[Corpora-List] Limiting queries to online database

William H Fletcher fletcher at usna.edu
Tue May 15 20:41:09 CEST 2012


Thanks for your suggestions, Mark and Andrew.

I have now implemented limits of 1 query per second and 1000 queries total per session per IP address. That should meet the needs of most legitimate researchers. Since many people are loath to register I'll wait and see whether I'll follow your example, Mark. Requiring registration and log in will certainly test users' motivation!

If any users on this list require a higher query limit I'll be glad to grant it.

BTW, the bots kept switching IP addresses; until I got my scripts tweaked they ran over 12.5 M queries in the last couple of days, for over 21 M in the last week!

Regards, Bill

-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Hardie, Andrew Sent: Monday, May 14, 2012 2:06 PM To: corpora at uib.no Subject: Re: [Corpora-List] Limiting queries to online database

Hi Bill,

My instinct would be to block the IP of any such nuisance clients completely.

The BNC is a very widely available dataset and if someone really does need to run such a vast number of queries for a legitimate purpose, they should be running them on a local copy and not leeching your bandwidth. Clearly they have the tech savvy to do so if they are capable of programming this robot. At the very least they should have had the courtesy either to throttle the pace of queries right back, or to contact you to ask permission.

And that's giving them the benefit of the doubt and assuming they are legitimate users to begin with; more likely than not it is some kind of spam probe or, as Mark suggested, someone trying to avoid paying for the BNC.

Although it won't deter malicious parties, you should consider using robots.txt and a Crawl-delay directive so that genuine good-faith users know what is an acceptable query rate.

All of this, by the way, is one reason why with my own software, I recommend people set up CQPweb servers with password-controlled access!

best

Andrew.

-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of William H Fletcher Sent: 14 May 2012 18:27 To: corpora at uib.no Subject: [Corpora-List] Limiting queries to online database

Hello,

My site http://phrasesinenglish.org/ provides query interfaces to databases derived from the BNC. The past week performance has suffered from extraordinarily high query traffic from a handful of IP addresses. I am seeking advice on what is a realistic limit to queries from one user and how how to limit traffic from a single IP efficiently.

In the past my policy was to place no limits on number of queries or size of datasets returned on the assumption that this generous approach facilitates research. Occasionally I have hit a sevrer with thousands of queries, but at a maximum pace of 1-2 per second. Most users on the site submit at most a few dozen queries per day. On rare occasions I have seen short bursts of say 20,000-60,000 queries from a single IP address.

Last week over I found over 8 million queries for co-occurrences of apparently random pairs of word forms (e.g. 'entertainer mussel') coming from several IP addresses in Beijing. Now, over the last day or so there have been almost 6 million queries from one IP address in Seoul (110-120 per second). It's a valuable stress-test for my server, but I fear the degradation of response times will drive away regular users.

1. What constitutes a reasonable number of queries per day to tolerate from a single robot user, after which access would be denied or limited?

2. How can I implement such access restrictions? I am using the Nginx server, MySQL / Sphinx and PHP on a Debian Linux platform. I know how to block an IP address completely, but have no good strategy for simply limiting such traffic.

Many thanks in advance for any feedback you can give.

Regards, Bill Fletcher

_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora

_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list