[Corpora-List] Extract plain text from Wikipedia dump XML format

Nasrin Baratali nasrin.baratali at gmail.com
Fri Jun 22 14:05:08 CEST 2012


hello,

In Corpora List, there is another post with the similar topic. You can find it here http://mailman.uib.no/public/corpora/2010-September/011285.html

I am working on Wikipedia dump and found out following tool is also suitable code.google.com/p/wikixmlj/

Regards,

Nasrin Baratalipour, Natural Language and text Processing Laboratory(http://ece.ut.ac.ir/NLP), School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran

On Wed, Jun 20, 2012 at 10:16 PM, Rahma Sellami <rahma.sellami at gmail.com>wrote:


> Hello,
>
> I downloaded WIkipedia dump XML format, I want to eliminate the wikipedia
> tags to extract the plain text.
> I found the tool wikiprep and I installed it but I do not know what
> script that eliminates the markup wikipedia.
>
> Thanks
> --
>
> RAHMA Sellami
> PhD Computer Science Student
> http://sites.google.com/site/rahmasellami/
> <http://sites.google.com/site/rahmasellami/>
> Faculty of Economic Sciences and management of Sfax
> ANLP Research Group
> http://sites.google.com/site/anlprg
>
> MIRACL Laboratory
> www.miracl.rnu.tn
>
> Email: rahma.sellami at gmail.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3413 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20120622/f16254e3/attachment.txt>



More information about the Corpora mailing list