[Corpora-List] corpus of plain text docs in English

JCorbett at umac.mo JCorbett at umac.mo
Tue Apr 5 12:13:20 CEST 2011


You could try the Corpus of Modern Scottish Writing (1700-1945) which has a range of text types going back to the 18th century. At the moment the texts can only be downloaded one by one - so you could work on a subcorpus to start with - but a bulk download should be made available in the not too distant future. See http://www.scottishcorpus.ac.uk/cmsw/ You can view digital facsimiles, transcriptions and plain text and also download plain text files.

Hope this helps,

John Corbett

From: corpora-request at uib.no To: corpora at uib.no Date: 05/04/2011 18:02 Subject: Corpora Digest, Vol 46, Issue 6 Sent by: corpora-bounces at uib.no

Today's Topics:

1. corpus of plain text docs in English (petar at lml.bas.bg)

2. Re: corpus of plain text docs in English (Mark Davies)

3. Call for Papers: "Language Technology for a Multilingual

Europe" (David Vilar)

4. CFP SIGIR 2011 Workshop on "entertain me": Supporting

Complex Search Tasks (Jaap Kamps)

----------------------------------------------------------------------

Message: 1 Date: Fri, 1 Apr 2011 10:13:28 +0300 From: petar at lml.bas.bg Subject: [Corpora-List] corpus of plain text docs in English To: Corpora at uib.no

Dear Corpora members,

I am working on a domain specific machine translation project. I am looking for a corpus of plain text (historical) documents in English. I would like to experiment whether standard n-gram model, trained on such docs, could be used to improve other machine translation techniques designed specially for historical docs. Would you recommend some corpora?

Thank you.

Best regards, Petar Mitankin

------------------------------

Message: 2 Date: Mon, 4 Apr 2011 08:43:17 -0600 From: Mark Davies <Mark_Davies at byu.edu> Subject: Re: [Corpora-List] corpus of plain text docs in English To: "petar at lml.bas.bg" <petar at lml.bas.bg>, "Corpora at uib.no"

<Corpora at uib.no>

Petar,

I'm not sure how far back you want the texts. If it's just to the early 1800s or so, you might check the links at the 400 million word Corpus of Historical American English (http://corpus.byu.edu/coha): Help / Composition of Corpus. It provides suggestions for some nice text archives, like Project Gutenberg, Making of America, etc.

For anything farther back than the early 1800s, you could just use the older texts from Project Gutenberg, or the many online archives of authors of Early Modern English. If your library is a member, you'll also want to check the huge collection at Early English Books Online (EEBO) for the machine readable (as opposed to the PDF image) texts.

Best,

Mark Davies

============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================


> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
Of
> petar at lml.bas.bg
> Sent: Friday, April 01, 2011 1:13 AM
> To: Corpora at uib.no
> Subject: [Corpora-List] corpus of plain text docs in English
>
> Dear Corpora members,
>
> I am working on a domain specific machine translation project. I am
looking for a
> corpus of plain text (historical) documents in English. I would like to
experiment
> whether standard n-gram model, trained on such docs, could be used to
improve
> other machine translation techniques designed specially for historical
docs. Would you
> recommend some corpora?
>
> Thank you.
>
> Best regards,
> Petar Mitankin
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

------------------------------

Message: 3 Date: Tue, 05 Apr 2011 10:52:13 +0200 From: David Vilar <david.vilar at dfki.de> Subject: [Corpora-List] Call for Papers: "Language Technology for a

Multilingual Europe" To: CORPORA at UIB.NO

PDF Version with complete information: http://www.dfki.de/~davi01/cfp/ws-cfp.en.pdf

Apologies if you receive multiple copies of this call.

Call for Papers: "Language Technology for a Multilingual Europe" ================================================================

Overview --------

The Workshop aims at bringing various groups together who are concerned with the broad topic of "Language Technology for a multilingual Europe". This encompasses on the one hand representatives from research and development in the field of language technologies, on the other hand users from quite divers areas. Two examples of the application of language technology is (automatic / machine) translation, and processing of texts from the humanities with methods from language technology, like automatic topic indexing, text mining, integrating numerous texts and additional information across languages etc.

These kinds of application areas and research and development in language technology have in common that they rely on resources (lexica, corpora, grammars, ontologies etc.), or that they produce these resources. A multilingual Europe, being supported by language technology, is only possible if an adequate, interoperable infrastructure of resources, including the related tooling, is available for all European languages.

In addition it is necessary that the aforementioned and other communities of developers and users of language technology stand as one, homogenous community. Only in this way it will be possible to assure the long-term political acceptance of the topic "language technology" in Europe.

Topics ------

The workshop aims at brining research and development from academia and industry together, to discuss the aforementioned technical and political prerequisites for language technology in Europe. Submissions may touch on the following or other aspects of this overall topic:

- Research and development of language technology in various areas

(Human Language Technology, ICT, eHumanities, ...) - Infrastructure for resources in language technology - Prerequisites for interoperability of language technology based

applications - Language technology and standardization - "Political perspectives" about requirements and the usefulness of

language technology, from the perspective of research, industry and

various user communities.

Important dates ---------------

Deadline for submission of abstracts: May 15th 2011 Notification of acceptance: June 15th 2011 Workshop: September 27th, the Tuesday before the GSCL conference

-- David Vilar Torres DFKI GmbH, Alt-Moabit 91c, 10559 Berlin Tel. (+49) 30 238 95 1845

--------------- Legal Note --------------- Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender), Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313

------------------------------

Message: 4 Date: Tue, 05 Apr 2011 11:11:26 +0200 From: Jaap Kamps <kamps at science.uva.nl> Subject: [Corpora-List] CFP SIGIR 2011 Workshop on "entertain me":

Supporting Complex Search Tasks To: corpora at uib.no

SIGIR 2011 Workshop on "entertain me": Supporting Complex Search Tasks July 28, Beijing http://staff.science.uva.nl/~kamps/entertainme/

Call for Papers: deadline June 3

* A Workshop on a Single Query ?!?

Searchers with a complex information need typically slice-and-dice their problem into several queries and subqueries, and laboriously combine the answers post hoc to solve their tasks. This workshop invites discussion about any technique, knowledge representation, model or technology to integrate the search results into a coherent session on a level of abstraction which matches the original information need.

Consider planning a social event at the last day of SIGIR, in the unknown city of Beijing, factoring in distances, timing, and preferences on budget, cuisine, and entertainment. A system supporting the entire search episode should "know" a lot, either from profiles or implicit information, or from explicit information in the query or from feedback.

This may lead to the (interactive) construction of a complexly structured query, but sometimes the most obvious query for a complex need is dead simple: "entertain me." Rather than returning ten-blue-lines in response to a 2.4-word query, the desired system should support searchers during their whole task or search episode, by iteratively constructing a complex query or search strategy, by exploring the result-space at every stage, and by combining the partial answers into a coherent whole.

Although a SIGIR Workshop devoted to a single query may seem extravagant, this query is just one example of the general problem of supporting simple and common requests that express complex and dynamic needs.

* Social Evening Program

Many interesting ideas will come out of the workshop, but how do we know if they are any good? We will have a special breakout group designing a mock-up for solving the "entertain me" query, charting out the background information (implicit and explicit context), the different sources (maps, web, social, news, ...), and the needed components and interaction. A group of local Peking University grad students is available to serve as oracles for local information.

The scientific evaluation of the resulting "entertainment plan" will be done by executing it in the evening after the workshop, with all participants.

- Are you willing and able to sponsor the social event? Please contact the organizers for details. - Do you want to take part? Read the Call for Submission and contribute!

* Call for Submissions

We invite the submission of papers that think outside the box, from any aspect of relevance to the workshop's theme, including:

- information seeking behavior, interaction, berry-picking; - information needs and ways of articulating them; - implicit and explicit feedback; - exploiting collection structure and semantic annotations; - exploratory search, HCI, UI and UX design; - situated search (maps, Geo, customized, personalized, ...); - entertainment search (broadcasters, content owners, network operators, device manufacturers).

We aim to bring together a varied group of researchers covering both user and system centered approaches, and together work on ways to make IR systems support searchers when interactively solving a complex task, such as the entertain me planning problem.

Help us shape the future of IR!

- Submit a short 2-page poster or position paper of relevance to supporting complex tasks, e.g., that identify specific research problems and use-cases, develop models/theory of complex tasks and interaction, discuss novel interfaces or system components, examine ways of evaluating, and/or report on preliminary experiments,

- and take actively part in the discussion at the Workshop.

The deadline is Monday June 3, 2011, submission details and further information are on http://staff.science.uva.nl/~kamps/entertainme/

Nick Belkin (Rutgers) Charlie Clarke (Waterloo) Ning Gao (Peking University) Jaap Kamps (Amsterdam) Jussi Karlgren (SICS)

---------------------------------------------------------------------- Send Corpora mailing list submissions to

corpora at uib.no

To subscribe or unsubscribe via the World Wide Web, visit

http://mailman.uib.no/listinfo/corpora or, via email, send a message with subject or body 'help' to

corpora-request at uib.no

You can reach the person managing the list at

corpora-owner at uib.no

When replying, please edit your Subject line so it is more specific than "Re: Contents of Corpora digest..."

_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora

End of Corpora Digest, Vol 46, Issue 6 **************************************

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 16300 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110405/67e0014c/attachment.txt> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 60157 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110405/67e0014c/attachment-0001.gif>



More information about the Corpora mailing list