corpus research Corpora-archive digest, Vol 1 #134 - 5 msgs

Davis, Boyd BDavis at email.uncc.edu
Fri May 27 15:37:00 CEST 2005


IN addition to using Veronis' discussion at the blogspot, I'm finding that in my search on Google, Webcorp, and Yahoo for a specific phrase delimited by quotation marks will elicit roughly 2 1/2 times the hits from Google (which will be correct ones) as on W and Y, and there is about a 60% overlap of citations across all 3.
Boyd Davis

________________________________

From: corpora-archive-admin at uib.no on behalf of corpora-archive-request at uib.no
Sent: Fri 5/27/2005 9:01 AM
To: corpora-archive at uib.no
Subject: Corpora-archive digest, Vol 1 #134 - 5 msgs




Today's Topics:

1. [Corpora-List] Query on the use of Google for corpus research (Peter K Tan)
2. [Corpora-List] Query on the use of Google for corpus research (Jean Veronis)
3. [Corpora-List] Two Postdocs in Comp Ling (Groningen) (Bouma G.)
4. Constitution (Michel Généreux)
5. [Corpora-List] Query on the use of Google for corpus research (Chris Jordan)

--__--__--

Message: 1
Date: Fri, 27 May 2005 14:14:30 +0800
To: corpora_AT_uib.no
From: Peter K Tan <PeterTan_AT_leonis.nus.edu.sg>
Subject: [Corpora-List] Query on the use of Google for corpus research
Cc: ellmml_AT_nus.edu.sg
Reply-To: corpora-archive_AT_uib.no

<html>
<body>
Just forwarding a question from a colleague. Would be grateful for
comments.<br><br>
Cheers,<br>
Peter<br><br>

<dl>
<dd><font face="tahoma" size=2>From:</b> Michelle Maria Lazar <br>

<dd>Sent:</b> 27 May 2005 11.27<br>

<dd>To:</b> Peter K W Tan; Talib, I S; Vincent Ooi; Wee Hock Ann,
Lionel<br>

<dd>Subject:</b> Query on the use of Google for corpus research<br><br>
</font>
<dd><font face="arial" size=2 color="#0000FF">Hi all,<br>
</font>
<dd>&nbsp;<br>

<dd><font face="arial" size=2 color="#0000FF">Someone has written to ask
me whether there's any foreseeable problem/objection in using Google to
gather statistical evidence on particular language usage, using key word
searches. It involves a submission of an article currently under review.
Does anyone have any experience/insight on this?<br><br>

<dd>Cheers,<br><br>

<dd>Michelle<br><br>
</font>
</dl></body>
</html>





--__--__--

Message: 2
Date: Fri, 27 May 2005 09:04:37 +0200
From: Jean Veronis <Jean.Veronis_AT_up.univ-mrs.fr>
To: Peter K Tan <PeterTan_AT_leonis.nus.edu.sg>
Cc: corpora_AT_uib.no, ellmml_AT_nus.edu.sg
Subject: [Corpora-List] Query on the use of Google for corpus research
Reply-To: corpora-archive_AT_uib.no

Peter K Tan a écrit :


>Does anyone have any

> experience/insight on this?

>

>


Well... yes! I made a series of in-depth analyses of Google counts. They
are totally bogus, and unusable for any kind of serious research.

There is a summary here :

http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

I applied the same criteria to Yahoo, and it seems that their results
are (if not sincere) at least credible and usable for research (I mean
that if they are inflated for marketting reasons, which is always
possible, it seems to be in a proportional way).


--
Jean Véronis
http://www.up.univ-mrs.fr/veronis
http://aixtal.blogspot.com








--__--__--

Message: 3
Date: Fri, 27 May 2005 09:51:53 +0200
From: "Bouma G." <gosse_AT_let.rug.nl>
To: corpora_AT_uib.no
Cc: nerbonne_AT_let.rug.nl
Subject: [Corpora-List] Two Postdocs in Comp Ling (Groningen)
Reply-To: corpora-archive_AT_uib.no

Two Postdocs in Computational Linguistics

Faculty of Arts/Center for Language and Cognition Groningen (CLCG)
announces the following two postdoc positions.

* one two-year postdoc in co-reference resolution (vacancy number
205138).
* one three-year postdoc in information retrieval on handwritten
documents (vacancy number 205150)


Co-reference Resolution (vacancy 205138)

Coreference resolution is a key ingredient for the automatic
interpretation of text. Practical applications, such as Information
Extraction (IE), summarization and Question Answering (QA), require
accurate identification of coreference relations between noun phrases in
general. We hope to develop a robust system for assigning such relations
automatically in Dutch text, and we will investigate the effect of
making coreference relations explicit on the accuracy of systems for IE
and QA. We will annotate a limited amount of application-specific corpus
material, which is required for the evaluation of the coreference
resolution system in the context of IE and QA.

The project is part of the Stevin-initiative of the Dutch and Flemish
government, and will be carried out in collaboration with the Language
Technology Group of the University of Antwerp and Language and Computing
NV.

A full project description is available here
<http://www.let.rug.nl/~gosse/Corea>. Dr. G.Bouma
<http://www.let.rug.nl/~gosse/> (gosse at let.rug.nl) will coordinate
the Corea project. He may be contacted for further information.

IR for Handwritten Documents (vacancy 205150)

This postdoc will work in a team with a PhD researcher (Artificial
Intelligence, Groningen) and a scientific programmer (National Archives,
The Hague) to develop a system (Scratch) supporting the search for
information in handwritten documents. A more complete description of the
project is available here
<http://www.ai.rug.nl/alice/nwo-catch-scratch/> (in Dutch).

The goal of the project is to apply methods of stochastic grammar
modeling to support free text retrieval methods in a test bed of scans
of handwritten material. This researcher will work with the PhD student
in pattern recognition who is expected to apply image processing and
classification techniques to the scans of handwritten material. He will
also work with grammar engineering specialists to develop software to
support the search process.

The ideal candidate will therefore work well independently as well as in
a team, and is excited about applying computational linguistics
techniques to a promising new application. He or she will command good
programming skills, good English and a willingness to learn Dutch.

John Nerbonne <http://www.let.rug.nl/~nerbonne/> (nerbonne at
let.rug.nl) and Gertjan van Noord <http://www.let.rug.nl/~vannoord/>
(vannoord at let.rug.nl) will supervise the postdoc in this position.
Lambert Schomaker <http://www.ai.rug.nl/~lambert/> (schomaker at
ai.rug.nl) is the head of the project involving this postdoc, the PhD
student in pattern recognition, and the scientific programmer.


Your profile (both positions)

* a PhD in Computational Linguistics, Artificial Intelligence,
Computer science or in a related field
* research experience including relevant publications
* able to work independently
* willingness to learn Dutch (in the case of the co-reference
resolution project, a passive knowledge is minimally required).


Salary, etc.

The Rijksuniversiteit Groningen offers a salary dependent on
qualifications and work experience up to a maximum of 3453 euro gross
per month for a full-time position.


The Center for Language and Cognition Groningen

The Center for Language and Cognition Groningen (CLCG)
<http://www.rug.nl/let/onderzoek/onderzoekinstituten/clcg/> is a
research institute within the Faculty of Arts <http://www.rug.nl/let/>
of the University of Groningen. <http://www.rug.nl/> It embraces all the
linguistic research in the faculty. A considerable number of the
researchers participate in the Center for Behavioral and Cognitive
Neurosciences (BCN) <http://www.rug.nl/bcn/>, and in the Landelijke
Onderzoekschool Taalwetenschap (LOT) <http://wwwlot.let.uu.nl/>. Within
the CLCG there are six research groups: Syntax/Semantics, Discourse and
Communication, Language Variation and Change, Computational Linguistics,
Neurolinguistics, and Language and Literacy Development over the Life
Span. There are graduate students
<http://www.rug.nl/let/onderzoek/onderzoekinstituten/clcg/grad> and
postdocs working in all of these groups.


Procedure

Please include in your application letter remarks about how this
position fits in your career goals and how you hope to contribute
scientifically. In addition, we would like to receive:

* your curriculum vitae
* a copy of your diploma together with a list of grades
* a list of publications
* the name and email address of two references

If you are interested in applying, please send the materials above
before June 30, 2005, to:

Rijksuniversiteit Groningen,
Afdeling Personeel & Organisatie
Postbus 72
9700 AB Groningen
The Netherlands

or by email to vmp at bureau.rug.nl

Please identify vacancy number on the envelope and in your letter.

--
Gosse Bouma, Informatiekunde, RUG, Postbus 716, 9700 AS Groningen
gosse_AT_let.rug.nl tel. +31-50-3635937 fax +31-50-3636855




--__--__--

Message: 4
Date: Mon, 23 May 2005 14:43:24 +0100
From: Michel Généreux
<michel.genereux_AT_itri.brighton.ac.uk>
To: CORPORA_AT_UIB.NO
Subject: Constitution
Reply-To: corpora-archive_AT_uib.no

Bart Defrancq wrote:


>

> Dear Jean,

>

>>

>> Well, why is the term "official languages" not included in the

>> Constitution then (I thought that it was intended to be a recap of

>> all the important concepts of the EU) ? I would have felt better.

>>

> I don't know of many constitutions which do mention the official

> languages of the country: the Spanish one does, i know and the French,

> but only recently. The US's does not. Even the Belgian does not (!):

>

Canada does: http://laws.justice.gc.ca/en/const/annex_e.html#I

"(1) English and French are the official languages of Canada and have
equality of status and equal rights and privileges as to their use in
all institutions of the Parliament and government of Canada."

Michel G.






--__--__--

Message: 5
Date: Fri, 27 May 2005 09:14:09 -0300
From: Chris Jordan <cjordan_AT_cs.dal.ca>
Subject: [Corpora-List] Query on the use of Google for corpus research
To: corpora_AT_uib.no
Reply-To: corpora-archive_AT_uib.no

Hello,

I would recommend looking at the following reference as it is highly
related:
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moriez.
Analysis of a very large Altavista Query Log. Technical Report 1998-014,
Digital SRC, 1998.
http://gatekeeper.dec.com/pub/DEC/SRC/technicalnotes/abstracts/src-tn-1998-014.html

There are some interesting issues with regard to examining such data.
The first that really comes to mind is that you have to be able to
distinguish between search sessions. This is non-trivial as users
typically do not have a single goal when searching; there is some work
by Spink on this topic. Both gathering this query data at the client
side and at the server side have their own set of problems.

As statistics are being gathered, it is important to discuss properties
of the user group (sample population) being evaluated. Depending on the
diversity of the sample (or lack of it) will determine what kind of
conclusions can be made.

Hope that helps,

Chris

Peter K Tan wrote:


> Just forwarding a question from a colleague. Would be grateful for

> comments.

>

> Cheers,

> Peter

>

> From: Michelle Maria Lazar

> Sent: 27 May 2005 11.27

> To: Peter K W Tan; Talib, I S; Vincent Ooi; Wee Hock Ann, Lionel

> Subject: Query on the use of Google for corpus research

>

> Hi all,

>

> Someone has written to ask me whether there's any foreseeable

> problem/objection in using Google to gather statistical evidence

> on particular language usage, using key word searches. It involves

> a submission of an article currently under review. Does anyone

> have any experience/insight on this?

>

> Cheers,

>

> Michelle

>






__--__--

Send Corpora-archive mailing list submissions to
corpora-archive at uib.no

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.uib.no/listinfo/corpora-archive
or, via email, send a message with subject or body 'help' to
corpora-archive-request at uib.no

You can reach the person managing the list at
corpora-archive-admin at uib.no

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Corpora-archive digest..."

--__--__--

_______________________________________________
Corpora-archive mailing list
Corpora-archive at uib.no
http://mailman.uib.no/listinfo/corpora-archive


End of Corpora-archive Digest


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 16844 bytes
Desc: not available
Url : https://mailman.uib.no/public/corpora-archive/attachments/20050527/320b2860/attachment.bin


More information about the Corpora-archive mailing list