[Corpora-List] Tool/program to estimate percent of English in a given text file?

Tim Baldwin tb at ldwin.net
Sat Nov 25 12:17:59 CET 2017


polyglot [1] is an open-source tool for doing exactly what you describe, as described in:

Lui, Marco, Jey Han Lau and Timothy Baldwin (2014) Automatic Detection and Language Identification of Multilingual Documents, Transactions of the Association for Computational Linguistics, 2(Feb):27−40.

Tim

[1] https://github.com/saffsd/polyglot

On Fri, 2017-11-24 at 23:08 +0100, Muhammad Shakir Aziz wrote:
> Hi
> I am dealing with computer mediated discourse and it has code switching as
> well. After my current project I plan to study it. To mark sentences with code
> switched texts, I thought Google Translate toolkit might be useful as it can
> take input and provide detected language name and a number (0-1) telling the
> confidence of detection result. But Google does not provide this service free
> as far as I have explored.
> Probably it isn't what you are looking for, just in case sharing maybe someone
> could provide a better idea to detect language or percentage of a certain
> language used in a given string.
> Regards 
>
> On Nov 24, 2017 10:56 PM, "Tristan Purvis" <tristan.purvis at aun.edu.ng> wrote:
> >
> > Hello,
> >
> > Quick version: Are there any publicly available tools or program modules I
> > could use to estimate the percent of English that is found in a given sample
> > of bilingual/multilingual text?
> >
> > In a study that includes looking at instances of code-switching (to English
> > words) for certain lexical items whose distribution and usage I'll be
> > tracking, I want to keep track of a given speaker's overall tendency for
> > mixing in English. It's not a high priority as a formal variable, so if it's
> > too time consuming to pursue, I'll be inclined to drop it, but it seems like
> > there might be some ready-made tool in the language detection field that
> > might incidentally serve my purposes ... Can anyone point me to a tool or
> > quick solution that can calculate an estimate of the percent of English
> > found in a given text sample?
> >
> > (Note: I only have 50-60 speakers to apply this too, so I can feasibly run
> > each one by one into a tool that can measure this. That is, I don't
> > necessarily need a tool that can run this in batches, though obviously that
> > would be an nice added convenience.)  
> >
> > Thanks in advance,
> > Tristan
> >
> > ==========================
> > Mohamed Tristan Purvis, PhD
> > Assistant Professor, School of Arts & Sciences
> > American University of Nigeria
> > https://sites.google.com/site/tristanpurvis/curriculum-vitae
> >
> >
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
> >
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
-- Tim Baldwin Professor, School of Computing and Information Systems Associate Dean (Research Training), Melbourne School of Engineering The University of Melbourne Victoria 3010, Australia

Tel: (+61)-3-8344-1363



More information about the Corpora mailing list