[Corpora-List] Google "region"-based searches

Adam Kilgarriff adam at lexmasterclass.com
Wed Nov 28 11:15:21 CET 2012


Googleology is bad science. Being at the mercy of every slight change in syntax or interpretation of Google's unpublished, undocumented search syntax is horrible. We need to move to more robust, less dependent approaches. If you have a web-scale corpus on your machine, you don't need Google. We have recently encoded English Clueweb (70b words) in the Sketch Engine - see LREC 2012 paper<http://www.lrec-conf.org/proceedings/lrec2012/pdf/1047_Paper.pdf>.

(Work supported by EU PRESEMT Project.) Others can use the same data - from Carnegie Mellon - and our procedures and scripts to give themselves this dataset too. Access to our version also a possibility

Adam

On 28 November 2012 09:49, Tristan Miller < miller at ukp.informatik.tu-darmstadt.de> wrote:


> Greetings.
>
> On 28/11/12 12:00 AM, John F Sowa wrote:
> > In ancient times (pre 21st century), Google supported Boolean
> > expressions for searching. But now it's impossible to control
> > their search in any predictable fashion.
> >
> > For example, I wanted to count the number of web pages that used
> > the phrase "enterprise integration pattern" and the word 'sql'.
> >
> > But when I type just "enterprise integration pattern" by itself,
> > I get 114,000 hits. When I add another word, the number should
> > decrease. But the following combination gets 137,000 hits:
> >
> > "enterprise integration pattern" sql
> >
> > The following combination gets 274,000 hits:
> >
> > "enterprise integration pattern" java
> >
> > And the following gets 25,900,000 hits:
> >
> > "enterprise integration pattern" java sql
> >
> > I get the same numbers with a one-line search or with
> > their so-called advanced search.
> >
> > Does anybody know how to bypass the Google heuristics and
> > force it to use a simple regular expression for searching?
>
> Google used to support a "+" modifier for search terms; this instructed
> the search to return only those pages which include the search terms.
> (Without the modifier, Google was free to disregard the search terms at
> its discretion.) The "+" modifier was dropped, probably for marketing
> reasons, once Google+ was introduced. Supposedly you can now achieve
> the same effect by putting the "required" terms in quotation marks, and
> in my experience this works most of the time. For your examples, it
> appears that sometimes it does and sometimes it doesn't:
>
> "enterprise integration pattern"
>
> gets 117,000 hits, but oddly both
>
> "enterprise integration pattern" sql
>
> and
>
> "enterprise integration pattern" "sql"
>
> get 137,000 results. On the other hand,
>
> "enterprise integration pattern" java sql
>
> gets 25,800,000 results, but
>
> "enterprise integration pattern" "java" "sql"
>
> returns a more sensible 8520 results.
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Doctoral Researcher
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universitšt Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

-- ======================================== Adam Kilgarriff <http://www.kilgarriff.co.uk/> adam at lexmasterclass.com Director Lexical Computing Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow University of Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

*DANTE: a lexical database for English<http://www.webdante.com>

* ======================================== -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5398 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121128/5b0b3db8/attachment.txt>



More information about the Corpora mailing list