[Corpora-List] Web search by document size

Gregor Erbach gor at acm.org
Fri Mar 11 15:51:01 CET 2005

AllTheWeb used to have an option for restricting the search
according to document size, but this option appears to be
no longer available.

I assume most search engines use some variant of TF*IDF
weighting for ranking search results; that is term frequency TF
(how many times a term appears in a document) multiplied by
the inverse of document frequency DF (in how many documents
a term appears), in combination with hyperlink analysis.
So, short documents in which infrequent search terms appear
will rank highly, but also longer documents in which the
search terms appear many times, which is not what you want.

Your best bet is probably to download all search results
(which should not be too many if your list of words is
long enough), and then sort the results by document length.
You can use the Google Web API (http://www.google.com/apis/)
for this. It will allow you up to 1000 searches per day.


Gregor Erbach

Brett Reynolds wrote:

> I'd like to be able to search the web for the smallest document

> containing all of a certain list of words. Is anyone aware of a search

> engine that will allow this kind of query?


> -----------------------

> Brett Reynolds

> English Language Centre

> Humber Institute of Technology and Advanced Learning

> Toronto, Ontario, Canada

> brett.reynolds at humber.ca



Dr. Gregor Erbach http://purl.org/net/gregor/
DFKI GmbH, Language Technology Lab http://www.dfki.de/
Tel. +49 (681) 302-5354 mailto:erbach at dfki.de

More information about the Corpora-archive mailing list