[Corpora-List] C-unit tagging

chris brew cbrew at acm.org
Thu Feb 21 19:57:56 CET 2008


All the sentence segmentation tools that I am aware (for example David Palmer's SATZ) of tag sentence boundaries by looking at a pretty wide range of features of the text, some of which are really matters of how newspapers happen to be laid out, and wouldn't immediately transfer to use with a spoken corpus. So I think you probably are not going to find an off-the-shelf tool.

In practice, the best next step is to find a friend who is good with Python, Perl, Ruby or another good text processing tool that handles regular expressions. Force your friend to sit down with you and take a very detailed look at precisely what the corpus transcription you are working with is like, then devise a regular expression that catches most of the boundaries you want. The result will probably be highly tied to the specifics of your corpus, and will probably not be perfect, but it will be a start.

On 21/02/2008, Su Qi Apple <applesuqi at yahoo.co.uk> wrote:
>
> Dear All
>
> I am just beginning my study in corpus linguistics and in a corpus of
> spoken English in particular. I want to ask if someone can tell me if you
> know of any tagging programs that can indicate C-units as opposed to
> sentences.
>
> I look forward to your replies.
>
> Apple Su Qi
>
> ------------------------------
> Sent from Yahoo!<http://us.rd.yahoo.com/mailuk/taglines/isp/control/*http://us.rd.yahoo.com/evt=51949/*http://uk.docs.yahoo.com/mail/winter07.html>- a smarter inbox.
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080221/a533770f/attachment.html



More information about the Corpora mailing list