[Corpora-List] charset identifier

Julien Nioche lists.digitalpebble at gmail.com
Mon Apr 18 10:21:42 CEST 2011


Jorg,

Have a look at Tika (http://tika.apache.org). It does mime-type, charset and language detection, is under Apache License and is widely used. You can find quite a bit of documentation on the Tika website but for those who want to go a bit deeper, the book Tika In Action is available from Manning Early Access Program [1].

HTH

Julien

[1] http://www.manning.com/mattmann/

-- * *Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

On 16 April 2011 11:51, Joerg Tiedemann <jorg.tiedemann at lingfil.uu.se>wrote:


> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.
> (Possibly with confidence values if available.)
>
> I know about these tools (but I would also appreciate any comments
> about their quality):
>
> enca: http://gitorious.org/enca
> This is exactly what I need but does not support a lot of languages.
> Maybe someone knows how to extend it with more languages/encodings?
>
> utrac: http://utrac.sourceforge.net/
> I haven't tested it but it seems to be quite restricted as well.
>
> cpdetector: http://cpdetector.sourceforge.net/
>
> https://github.com/goerz/convert_encoding.py
> includes a "guess encoding option":
>
> These tools do not seem to be freely available:
> http://www.lingua-systems.com/language-identifier/lidc-application/
> http://www.lingua-systems.com/unicode-converter/autouniconv-library/
>
> The standard unix tool 'file' is of course also sometimes helpful but
> too restricted.
>
> Is there anything else (that I can use without training specific models
> myself)?
>
> Thanks!
> Jörg
>
>
>
> --
>
> **********************************************************************************
> Jörg Tiedemann
> jorg.tiedemann at lingfil.uu.se
> Dep. of Linguistics and Philology
> http://stp.lingfil.uu.se/~joerg/
> Uppsala University tel: +46 (0)18 - 471
> 1412
> Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4663 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110418/ad2dbb43/attachment.txt>



More information about the Corpora mailing list