[Corpora-List] Tool/program to estimate percent of English in a given text file?

Muhammad Shakir Aziz true.friend2004 at gmail.com
Sat Nov 25 15:29:15 CET 2017


Thanks a lot for your responses. My problem is a little more difficult because code switching is in Roman Urdu. I have observed that even Twitter is not able to recognize tweets' language sometimes, probably because the text is too short. detectlanguage.com is a good service, thanks for the information. It might help solve the problem to a certain extent. And at least it is free. CodeSwitchFinder probably won't work because the script is not Urdu. Regards

On Sat, Nov 25, 2017 at 12:17 PM, Tim Baldwin <tb at ldwin.net> wrote:


> polyglot [1] is an open-source tool for doing exactly what you describe, as
> described in:
>
> Lui, Marco, Jey Han Lau and Timothy Baldwin (2014) Automatic Detection and
> Language Identification of Multilingual Documents, Transactions of the
> Association for Computational Linguistics, 2(Feb):27−40.
>
>
> Tim
>
> [1] https://github.com/saffsd/polyglot
>
> On Fri, 2017-11-24 at 23:08 +0100, Muhammad Shakir Aziz wrote:
> > Hi
> > I am dealing with computer mediated discourse and it has code switching
> as
> > well. After my current project I plan to study it. To mark sentences
> with code
> > switched texts, I thought Google Translate toolkit might be useful as it
> can
> > take input and provide detected language name and a number (0-1) telling
> the
> > confidence of detection result. But Google does not provide this service
> free
> > as far as I have explored.
> > Probably it isn't what you are looking for, just in case sharing maybe
> someone
> > could provide a better idea to detect language or percentage of a certain
> > language used in a given string.
> > Regards
> >
> > On Nov 24, 2017 10:56 PM, "Tristan Purvis" <tristan.purvis at aun.edu.ng>
> wrote:
> > >
> > > Hello,
> > >
> > > Quick version: Are there any publicly available tools or program
> modules I
> > > could use to estimate the percent of English that is found in a given
> sample
> > > of bilingual/multilingual text?
> > >
> > > In a study that includes looking at instances of code-switching (to
> English
> > > words) for certain lexical items whose distribution and usage I'll be
> > > tracking, I want to keep track of a given speaker's overall tendency
> for
> > > mixing in English. It's not a high priority as a formal
> variable, so if it's
> > > too time consuming to pursue, I'll be inclined to drop it, but it
> seems like
> > > there might be some ready-made tool in the language detection field
> that
> > > might incidentally serve my purposes ... Can anyone point me to a tool
> or
> > > quick solution that can calculate an estimate of the percent of English
> > > found in a given text sample?
> > >
> > > (Note: I only have 50-60 speakers to apply this too, so I can feasibly
> run
> > > each one by one into a tool that can measure this. That is, I don't
> > > necessarily need a tool that can run this in batches, though obviously
> that
> > > would be an nice added convenience.)
> > >
> > > Thanks in advance,
> > > Tristan
> > >
> > > ==========================
> > > Mohamed Tristan Purvis, PhD
> > > Assistant Professor, School of Arts & Sciences
> > > American University of Nigeria
> > > https://sites.google.com/site/tristanpurvis/curriculum-vitae
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > > Corpora mailing list
> > > Corpora at uib.no
> > > https://mailman.uib.no/listinfo/corpora
> > >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
> --
> Tim Baldwin
> Professor, School of Computing and Information Systems
> Associate Dean (Research Training), Melbourne School of Engineering
> The University of Melbourne
> Victoria
> 3010, Australia
>
> Tel: (+61)-3-8344-1363
>
>
>

-- Muhammad Shakir Aziz محمد شاکر عزیز Phd Candidate, Westfälische Wilhelms-Universität Münster Linguist and Translator for English, Urdu and Punjabi Urdu:- http://awaz-e-dost.blogspot.com/ English:- http://linguisticslearner.blogspot.com/ Facebook:- http://www.facebook.com/truefriend2004 Skype:- true_friend2004 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7498 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20171125/5a643a50/attachment.txt>



More information about the Corpora mailing list