[Corpora-List] converting non-embedded tags into embedded ones

Oliver Mason O.Mason at bham.ac.uk
Sun Feb 24 17:16:09 CET 2008

The easiest solution (which also allows for several lex-elements on the same line, provided they are *always* on a line (ie no breaks between lex elements)) is to use sed, the unix stream editor, which doesn't even require any manual preparation:

cat your_text_file | sed 's/<lex pos=\([^>]*\)>\([^<]*\)<\/lex>/\2_\1 /g' > output_file

But then, using XML tools is probably a bit more user-friendly... and safer, as it doesn't rely on the exact formatting. But I feel more comfortable with sed than with XSLT :)


PS What this expression does is to replace the whole line ('s' for substitute) by the matched sub-expressions (the bits between \(...\) - in reverse order, hence \2 ('time' in the example) and \1 ('NN'). The final 'g' means global, ie more than once a line if applicable. 'sed' can be a little daunting, but it is very powerful.

More information about the Corpora mailing list