[Corpora-List] converting non-embedded tags into embedded ones

Emiliano Guevara emiliano.guevara at unibo.it
Sun Feb 24 10:57:31 CET 2008

Dear Warren,

No idea if and/or how you could do that in M$ Word or Windows... or even why you would try to do corpus linguistics with an application that is designed to write business letters...

But if have a access to a Linux/Unix box, a bit of regex and AWK would solve your problem in seconds.

Assuming that the corpus is REALLY ALWAYS like this:

<lex pos=NN>time</lex> and that every <lex></lex> element is on a separate line:

1. open your favorite text editor (capable of doing general search and replace with regexes)

2. delete "<lex pos=" string,

delete "</lex>" string,

you're left with:


3. open a shell and do

awk 'BEGIN {FS=">"; OFS = "_";}{print $2, $1}'

That's all.

If you really cannot work on anything else different than windows/M$ Word, I think you could try doing steps 1 and 2 on M$ Word manually, just use Find/Replace (never try doing macros.... bad thing!) and then convert the corpus with the format "NN>time" to a huge table, columns divided by ">". After that, grab the second column of the table and move to the desired position. Then reconvert everything back into text format.

good luck,


On 24 Feb 2008, at 10:15, Warren Tang wrote:

> Could someone help me with this problem:
> I have texts with non-embedded tags:
> eg: <lex pos=NN>time</lex>
> but I would like to convert them to embedded tags (if this is the
> right term):
> eg: time_NN
> I have tried using a macro in MS Word but can't seem to find a way
> to get it to do it. I do not know how to program so your expertise
> here would be most appreciated.
> Warren
