My site http://phrasesinenglish.org/ provides query interfaces to databases derived from the BNC. The past week performance has suffered from extraordinarily high query traffic from a handful of IP addresses. I am seeking advice on what is a realistic limit to queries from one user and how how to limit traffic from a single IP efficiently.
In the past my policy was to place no limits on number of queries or size of datasets returned on the assumption that this generous approach facilitates research. Occasionally I have hit a sevrer with thousands of queries, but at a maximum pace of 1-2 per second. Most users on the site submit at most a few dozen queries per day. On rare occasions I have seen short bursts of say 20,000-60,000 queries from a single IP address.
Last week over I found over 8 million queries for co-occurrences of apparently random pairs of word forms (e.g. 'entertainer mussel') coming from several IP addresses in Beijing. Now, over the last day or so there have been almost 6 million queries from one IP address in Seoul (110-120 per second). It's a valuable stress-test for my server, but I fear the degradation of response times will drive away regular users.
1. What constitutes a reasonable number of queries per day to tolerate from a single robot user, after which access would be denied or limited?
2. How can I implement such access restrictions? I am using the Nginx server, MySQL / Sphinx and PHP on a Debian Linux platform. I know how to block an IP address completely, but have no good strategy for simply limiting such traffic.
Many thanks in advance for any feedback you can give.
Regards, Bill Fletcher