[Corpora-List] syllable contact frequency - CELEX

Caren Brinckmann caren at brinckmann.de
Wed Oct 22 14:51:54 CEST 2008

Dear Katharina,

if you take the CELEX file gpl.cd ("German phonology lemma"), you find the transcription in DISC format in the fourth column. So you can use the following UN*X pipeline to extract e.g. the number of all lemmas containing "p" followed by "t" which are separated by a syllable boundary (and possibly an accent marker):

cut -d"\\" -f4 gpl.cd | grep "p-'*t" | wc -l

(The result should be 181.)

The first part of the pipeline (cut -d"\\" -f4 gpl.cd) extracts the fourth column of the file gpl.cd. The second part (grep "p-'*t") searches for a certain pattern in the extracted column using a regular expression. The last part (wc -l) counts the number of lines (i.e. lemmas in this case) that match the regular expression. Simply change the second part to suit your search.

Keep in mind though that CELEX is not a corpus but a lexicon. So the numbers you get are type frequencies, i.e. it tells you how many _lemmas_ that are listed in CELEX contain your search pattern.

Hope this helps Caren.

