[Corpora-List] Fwd: converting non-embedded tags into embedded ones

Stefan Th. Gries stgries at gmail.com
Mon Feb 25 16:09:10 CET 2008


Hi

Once you have the corpora as text files rather than word, you could do it in R (<http://www.r-project.org/>):

# a ridiculously oversimplified corpus line called x,

# which hopefully still conveys the idea

x<-"<lex pos=NN>time</lex> <lex pos=JJ>funny</lex>"

# a regular expression that does what you want

gsub("<.*?pos=([^>]*)>([^<]*?)</.*?>([^<]*)", "\\2_\\1\\3", x, perl=TRUE)

# the result

"time_NN funny_JJ"

Of course, this may have to be adapted in the light of how the rest of

ypour corpus looks like. Stuff like this will be explained in my forthcoming textbook /Quantitative Corpus Linguistics with R: A Practical Introduction/; the companion website is at <http://groups.google.com/group/corpling-with-r/web/quantitative-corpus-linguistics-with-r> and the newsgroup where more such questions could also be posted is at <http://groups.google.com/group/corpling-with-r>.

# And here's an explanation of the regular expression: Match the character "<" literally «<» Match any single character that is not a line break character «.*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?» Match the characters "pos=" literally «pos=» Match the regular expression below and capture its match into backreference number 1 «([^>]*)»

Match any character that is NOT a ">" «[^>]*»

Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*» Match the character ">" literally «>» Match the regular expression below and capture its match into backreference number 2 «([^<]*?)»

Match any character that is NOT a "<" «[^<]*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?» Match the characters "</" literally «</» Match any single character that is not a line break character «.*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?» Match the character ">" literally «>» Match the regular expression below and capture its match into backreference number 3 «([^<]*)»

Match any character that is NOT a "<" «[^<]*»

Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

HTH, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries -----------------------------------------------



More information about the Corpora mailing list