[Corpora-List] converting non-embedded tags into embedded ones

Stefan Th. Gries stgries at gmail.com
Sun Feb 24 18:27:10 CET 2008


Hi


> Subject: [Corpora-List] converting non-embedded tags into embedded ones
> To: Corpora_AT_uib.no
> Could someone help me with this problem:
> I have texts with non-embedded tags:
> eg: <lex pos=NN>time</lex>
> but I would like to convert them to embedded tags (if this is the right term):
> eg: time_NN
> I have tried using a macro in MS Word but can't seem to find a way to get it to do it. I do not know how to program so your expertise here would be most appreciated.

Once you have the corpora as text files rather than word, you could do it in R (<http://www.r-project.org/>):

# a ridiculously oversimplified corpus line called x, # which hopefully still conveys the idea x<-"<lex pos=NN>time</lex> <lex pos=JJ>funny</lex>"

# a regular expression that does what you want gsub("<.*?pos=([^>]*)>([^<]*?)</.*?>([^<]*)", "\\2_\\1\\3", x, perl=TRUE)

# the result "time_NN funny_JJ"

Of course, this may have to be adapted in the light of how the rest of ypour corpus looks like. Stuff like this will be explained in my forthcoming textbook /Quantitative Corpus Linguistics with R: A Practical Introduction/; the companion website is at <http://groups.google.com/group/corpling-with-r/web/quantitative-corpus-linguistics-with-r> and the newsgroup where more such questions could also be posted is at <http://groups.google.com/group/corpling-with-r>.

# And here's an explanation of the regular expression: Match the character "<" literally «<» Match any single character that is not a line break character «.*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?» Match the characters "pos=" literally «pos=» Match the regular expression below and capture its match into backreference number 1 «([^>]*)»

Match any character that is NOT a ">" «[^>]*»

Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Match the character ">" literally «>» Match the regular expression below and capture its match into backreference number 2 «([^<]*?)»

Match any character that is NOT a "<" «[^<]*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?» Match the characters "</" literally «</» Match any single character that is not a line break character «.*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?» Match the character ">" literally «>» Match the regular expression below and capture its match into backreference number 3 «([^<]*)»

Match any character that is NOT a "<" «[^<]*»

Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

HTH, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries -----------------------------------------------



More information about the Corpora mailing list