[Corpora-List] extract the content of an html element (was: chaker jebari)
peter.adolphs at student.hu-berlin.de
Sat Dec 2 14:25:01 CET 2006
Chaker Jabbari wrote:
> I need a tool (under windows) to extract the content of any html tag
> from a html/text file.
Do you want to strip the tags or do you want to extract the content of
specific html elements?
You could either extract the content with regular expressions or convert
the HTML file to XML (tidy, jtidy) and transform that into the desired
output (with XSLT) for cleaner results. In both cases, I would recommend
jEdit -- a powerful text editor, Free Software, written in Java. There
are numerous plugins and macros available that you could probably use
for your task (plugins: JTidy and XSLT; macros: for instance, my own
regular-expression-based "Extract Matches").
Hope that helped!
Peter Adolphs peter.adolphs at student.hu-berlin.de gpg/pgp welcome!
More information about the Corpora-archive