[Corpora-List] Extract plain text from Wikipedia dump XML format

Rahma Sellami rahma.sellami at gmail.com
Wed Jun 20 19:46:05 CEST 2012


Hello,

I downloaded WIkipedia dump XML format, I want to eliminate the wikipedia tags to extract the plain text. I found the tool wikiprep and I installed it but I do not know what script that eliminates the markup wikipedia.

Thanks --

RAHMA Sellami PhD Computer Science Student http://sites.google.com/site/rahmasellami/ <http://sites.google.com/site/rahmasellami/> Faculty of Economic Sciences and management of Sfax ANLP Research Group http://sites.google.com/site/anlprg

MIRACL Laboratory www.miracl.rnu.tn

Email: rahma.sellami at gmail.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1421 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20120620/2c518057/attachment.txt>



More information about the Corpora mailing list