[Corpora-List] Tool/program to estimate percent of English in a given text file?

Tristan Purvis tristan.purvis at aun.edu.ng
Sat Nov 25 23:13:13 CET 2017


Thank you everyone for the input. The resource Tim Baldwin shared (polyglot) sounds very promising.

========================== Mohamed Tristan Purvis, PhD Assistant Professor, School of Arts & Sciences American University of Nigeria tristan.purvis at aun.edu.ng Glo/AUN: 08057055115; Mtn: 07065631040; VoIP: 1400 Office: A&S 233

On Sat, Nov 25, 2017 at 3:29 PM, Muhammad Shakir Aziz < true.friend2004 at gmail.com> wrote:


> Thanks a lot for your responses.
> My problem is a little more difficult because code switching is in Roman
> Urdu. I have observed that even Twitter is not able to recognize tweets'
> language sometimes, probably because the text is too short.
> detectlanguage.com is a good service, thanks for the information. It
> might help solve the problem to a certain extent. And at least it is free.
> CodeSwitchFinder probably won't work because the script is not Urdu.
> Regards
>
> On Sat, Nov 25, 2017 at 12:17 PM, Tim Baldwin <tb at ldwin.net> wrote:
>
>> polyglot [1] is an open-source tool for doing exactly what you describe,
>> as
>> described in:
>>
>> Lui, Marco, Jey Han Lau and Timothy Baldwin (2014) Automatic Detection and
>> Language Identification of Multilingual Documents, Transactions of the
>> Association for Computational Linguistics, 2(Feb):27−40.
>>
>>
>> Tim
>>
>> [1] https://github.com/saffsd/polyglot
>>
>> On Fri, 2017-11-24 at 23:08 +0100, Muhammad Shakir Aziz wrote:
>> > Hi
>> > I am dealing with computer mediated discourse and it has code switching
>> as
>> > well. After my current project I plan to study it. To mark sentences
>> with code
>> > switched texts, I thought Google Translate toolkit might be useful as
>> it can
>> > take input and provide detected language name and a number (0-1)
>> telling the
>> > confidence of detection result. But Google does not provide this
>> service free
>> > as far as I have explored.
>> > Probably it isn't what you are looking for, just in case sharing maybe
>> someone
>> > could provide a better idea to detect language or percentage of a
>> certain
>> > language used in a given string.
>> > Regards
>> >
>> > On Nov 24, 2017 10:56 PM, "Tristan Purvis" <tristan.purvis at aun.edu.ng>
>> wrote:
>> > >
>> > > Hello,
>> > >
>> > > Quick version: Are there any publicly available tools or program
>> modules I
>> > > could use to estimate the percent of English that is found in a given
>> sample
>> > > of bilingual/multilingual text?
>> > >
>> > > In a study that includes looking at instances of code-switching (to
>> English
>> > > words) for certain lexical items whose distribution and usage I'll be
>> > > tracking, I want to keep track of a given speaker's overall tendency
>> for
>> > > mixing in English. It's not a high priority as a formal
>> variable, so if it's
>> > > too time consuming to pursue, I'll be inclined to drop it, but it
>> seems like
>> > > there might be some ready-made tool in the language detection field
>> that
>> > > might incidentally serve my purposes ... Can anyone point me to a
>> tool or
>> > > quick solution that can calculate an estimate of the percent of
>> English
>> > > found in a given text sample?
>> > >
>> > > (Note: I only have 50-60 speakers to apply this too, so I can
>> feasibly run
>> > > each one by one into a tool that can measure this. That is, I don't
>> > > necessarily need a tool that can run this in batches, though
>> obviously that
>> > > would be an nice added convenience.)
>> > >
>> > > Thanks in advance,
>> > > Tristan
>> > >
>> > > ==========================
>> > > Mohamed Tristan Purvis, PhD
>> > > Assistant Professor, School of Arts & Sciences
>> > > American University of Nigeria
>> > > https://sites.google.com/site/tristanpurvis/curriculum-vitae
>> > >
>> > >
>> > >
>> > >
>> > > _______________________________________________
>> > > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> > > Corpora mailing list
>> > > Corpora at uib.no
>> > > https://mailman.uib.no/listinfo/corpora
>> > >
>> > _______________________________________________
>> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> > Corpora mailing list
>> > Corpora at uib.no
>> > https://mailman.uib.no/listinfo/corpora
>> --
>> Tim Baldwin
>> Professor, School of Computing and Information Systems
>> Associate Dean (Research Training), Melbourne School of Engineering
>> The University of Melbourne
>> Victoria
>> 3010, Australia
>>
>> Tel: (+61)-3-8344-1363
>>
>>
>>
>
>
> --
> Muhammad Shakir Aziz محمد شاکر عزیز
> Phd Candidate, Westfälische Wilhelms-Universität Münster
> Linguist and Translator for English, Urdu and Punjabi
> Urdu:- http://awaz-e-dost.blogspot.com/
> English:- http://linguisticslearner.blogspot.com/
> Facebook:- http://www.facebook.com/truefriend2004
> Skype:- true_friend2004
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9247 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20171125/cab905b5/attachment.txt>



More information about the Corpora mailing list