[Corpora-List] converting non-embedded tags into embedded ones

Emiliano Guevara emiliano.guevara at unibo.it
Sun Feb 24 10:57:31 CET 2008


Dear Warren,

No idea if and/or how you could do that in M$ Word or Windows... or even why you would try to do corpus linguistics with an application that is designed to write business letters...

But if have a access to a Linux/Unix box, a bit of regex and AWK would solve your problem in seconds.

Assuming that the corpus is REALLY ALWAYS like this:

<lex pos=NN>time</lex> and that every <lex></lex> element is on a separate line:

1. open your favorite text editor (capable of doing general search and replace with regexes)

2. delete "<lex pos=" string,

delete "</lex>" string,

you're left with:

"NN>time"

3. open a shell and do

awk 'BEGIN {FS=">"; OFS = "_";}{print $2, $1}'

That's all.

If you really cannot work on anything else different than windows/M$ Word, I think you could try doing steps 1 and 2 on M$ Word manually, just use Find/Replace (never try doing macros.... bad thing!) and then convert the corpus with the format "NN>time" to a huge table, columns divided by ">". After that, grab the second column of the table and move to the desired position. Then reconvert everything back into text format.

good luck,

E.

On 24 Feb 2008, at 10:15, Warren Tang wrote:


> Could someone help me with this problem:
>
> I have texts with non-embedded tags:
>
> eg: <lex pos=NN>time</lex>
>
> but I would like to convert them to embedded tags (if this is the
> right term):
>
> eg: time_NN
>
> I have tried using a macro in MS Word but can't seem to find a way
> to get it to do it. I do not know how to program so your expertise
> here would be most appreciated.
>
>
> Warren
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

**************************************** Emiliano R. Guevara FacoltÓ di Lingue e Lett. Straniere Dip. di Lingue e Lett. Straniere UniversitÓ di Bologna Via Cartoleria 5 (40124) Bologna, Italia

Homepage: http://morbo.lingue.unibo.it/

E-mail: emiliano.guevara at unibo.it

emiguevara at gmail.com ****************************************



More information about the Corpora mailing list