[Corpora-List] Query about corpora of spoken English

Briony Williams b.williams at bangor.ac.uk
Fri Dec 2 17:17:01 CET 2005

R.M.Salkie at bton.ac.uk wrote:

> My colleague Nicolas Ballier (nicolas.ballier at lli.univ-paris13.fr

> <mailto:nicolas.ballier at lli.univ-paris13.fr> ) has asked me to post the

> following two queries. Please reply directly to him.

It may be useful to others to have the replies in a public forum like this
one - so here is a quick reply to the CORPORA list.

> 1. Is there a web page which lists currently available corpora of

> spoken English (eg MARSEC MAchine REadable Spoken ENglish Corpus), stating

> whether the sound files are available?

You could try the catalogue pages of:-

a) Linguistic Data Consortium - subset "speech"-

b) Evaluations and Language Resources DIstribution Agency -

c) International Computer Archive of Modern and Medieval English

d) The MARSEC corpus

> 2. Is there software available to align texts and sound files: for

> example, software that enables the user to listen to any part of the

> document by clicking on a word in the text?

First the soundfile needs to be aligned with the linguistic annotation. Some
popular applications currently used for doing this manually are the following
(there are other applications for automatic segmentation of speech files).
All of these can be used to click on and listen to an individual word once a
word-level segmentation has been carried out.

a) Praat (has a very flexible scripting language):

b) Emu (segment-level and also higher linguistic levels, plus hierarchical
structure: has some scripting capability for automatic building of trees):

c) Transcriber ("It provides a user-friendly graphical user interface for
segmenting long duration speech recordings, transcribing them, and labeling
speech turns, topic changes and acoustic conditions. It is more specifically
designed for the annotation of broadcast news recordings, for creating
corpora used in the development of automatic broadcast news transcription
systems, but its features might be found useful in other areas of speech

d) MATE workbench ("a program designed to aid in the display, editing and
querying of annotated speech corpora")

These are by no means the only tools available (I have omitted xlabel, as it
is no longer supported).

Best regards

Briony Williams

More information about the Corpora-archive mailing list