[Corpora-List] Urdu-Hindi Transliteration corpora

Hassan Sajjad hassaan84s at gmail.com
Tue Mar 20 14:42:22 CET 2018


Hi Nick,

Here is a link to almost 2000 Hindi-Urdu transliteration pairs used in the paper mentioned by Alex.

http://alt.qcri.org/~hsajjad/resources/hindiUrduGoldStandard-15-12-09

Every line in the list consists of a Hindi word, a Urdu word and a label where a label can be "ti":transliteration, "ta":translation, "ma": misalignment, etc . Since you are interested in transliteration only, simply grep "ti" and you will get around 2000 word pairs that are transliteration of each other.

If you want to go with transliteration mining, you can use this set for evaluation purposes.

Let me know if something is not clear.

Cheers, Hassan

On Tue, Mar 20, 2018 at 4:00 PM, <corpora-request at uib.no> wrote:


> Today's Topics:
>
> 1. Postdoc: NLP for literary texts, UC Berkeley (David Bamman)
> 2. Re: Urdu-Hindi Transliteration corpora (Nick Ruiz)
> 3. Re: Urdu-Hindi Transliteration corpora (Alexander Fraser)
> 4. CfP: LACompLing2018 - Logic and Algorithms in Computational
> Linguistics 2018, Stockholm (Roussanka Loukanova)
> 5. CFP - Special Session EnGeoData - Environmental and
> Geo-spatial Data Analytics - DSAA'2018 (Mathieu Roche)
> 6. CFP: COLING 2018 First Workshop on Natural Language
> Processing for Internet Freedom (NLP4IF) (Anna Feldman)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 19 Mar 2018 10:23:47 -0700
> From: David Bamman <david.bamman at gmail.com>
> Subject: [Corpora-List] Postdoc: NLP for literary texts, UC Berkeley
> To: corpora at uib.no
>
> Postdoc: NLP for computational literary analysis
>
> About the project:
>
> Literary novels push the limits of natural language processing. While much
> work in NLP has been heavily optimized toward the narrow domain of
> contemporary English newswire (and now increasingly social media), literary
> novels are an entirely different animal?the long, complex sentences in
> novels strain the limits of syntactic parsers with super-linear
> computational complexity, their use of figurative language challenges
> representations of meaning based on neo-Davidsonian semantics, and their
> long length (ca. 100,000 words on average) rules out existing solutions for
> problems like coreference resolution that expect a small set of candidate
> antecedents.
>
> At the same time, fiction drives computational research questions that are
> uniquely interesting to that domain, and this interest often spills out
> elsewhere in unexpected ways. The task of authorship attribution was first
> proposed by Mendenhall (1887) to discriminate the works of Francis Bacon,
> Shakespeare and Christopher Marlowe, which later drove the pioneering work
> on the Federalist Papers by Mosteller and Wallace (1964) and now is used in
> applications as far removed as forensic analysis. Current active areas of
> literary NLP research include extracting social networks from novels (Elson
> et al., 2010) learning representations of character relationships (Iyyer et
> al., 2016), quote attribution (Muzny et al., 2017), and learning to infer
> readers' attitudes to the stories they read (Milli and Bamman, 2016).
>
> In this project, we will focus on developing computational models of a
> uniquely literary problem: plot. We will set out to develop and improve the
> fundamental applications in natural language processing that help make a
> realistic computational model of plot possible; while "plot" itself is a
> complex abstraction, one contribution of our work here is to decompose it
> into solvable sub-problems, each of which can be researched and evaluated
> on its own terms. We will focus in this work on the atomic elements: at the
> very least, plot involves people (characters), places (the setting where
> action takes place), time (when those actions take place), and things
> (objects that are important), all interacting through depicted events (in
> the form of actions, not descriptions). Each of these atomic elements
> entails individual sub-problems in NLP; some of these exist as formal
> problems (named entity recognition, character clustering, temporal
> information processing), while others do not yet.
>
> In this role, you carry out primary research in this general area, work
> with graduate students and supervise collaborative teams of undergraduates
> in computer science, data science, and English literature. This is a
> one-year position, beginning anytime before September 1, 2018.
>
> About you:
>
> Qualified candidates should have a track record of publishing in NLP
> conferences (e.g., ACL, EMNLP), journals (TACL), and/or workshops
> associated with them (e.g., CLfL, LaTeCH). The area of the PhD can be a
> technical field (computer science, information science, statistics) or an
> area of the humanities with a strong computational research focus.
>
> To apply:
>
> To apply, send a CV, cover letter, links to two writing samples, and the
> names and contact information for three references familiar with your work
> to David Bamman (dbamman at berkeley.edu <mailto:dbamman at berkeley.edu>).
> Applications will be reviewed on a rolling basis.
>
>
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 4007 bytes
> Desc: not available
> URL: <https://mailman.uib.no/public/corpora/attachments/
> 20180319/0975a4d5/attachment.txt>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 19 Mar 2018 14:26:35 -0400
> From: Nick Ruiz <nruiz at interactions.com>
> Subject: Re: [Corpora-List] Urdu-Hindi Transliteration corpora
> To: Alexander Fraser <fraser at cis.uni-muenchen.de>
> Cc: corpora at uib.no
>
> Thanks, Alex! Nadir reached out to me to discuss transliteration mining. I
> might go that route or try to crowdsource a small corpus, if no other
> resources pop up. Thanks to everyone else who has graciously replied to me
> with ideas as well.
>
> Best,
> Nick
>
> On Mon, Mar 19, 2018 at 2:24 PM, Alexander Fraser <
> fraser at cis.uni-muenchen.de> wrote:
>
> > Hi Nick,
> >
> > Maybe relevant, maybe not:
> >
> > Nadir Durrani, Hassan Sajjad, Alexander Fraser, Helmut Schmid (2010).
> Hindi-to-Urdu
> > Machine Translation Through Transliteration
> > <http://www.cis.uni-muenchen.de/~fraser/pubs/durrani_acl2010.pdf>. In
> > Proceedings of the 48th Annual Meeting of the Association for
> Computational
> > Linguistics (ACL), pages 465-474, Uppsala, Sweden, July.
> >
> > Cheers, Alex
> >
> >
> > On Sat, Mar 17, 2018 at 7:59 AM, Nick Ruiz <nruiz at interactions.com>
> wrote:
> >
> >> Hi all,
> >>
> >> Can you help me identify any Urdu-Hindi parallel transliteration corpora
> >> that are available on the web? By transliteration, I mean strictly the
> >> conversion of writing systems, not translation. Thanks in advance!
> >>
> >> Kind regards,
> >>
> >> Nicholas Ruiz
> >> Interactions Labs
> >>
> >> ************************************************************
> >> *******************
> >>
> >> This e-mail and any of its attachments may contain Interactions LLC
> >> proprietary information, which is privileged, confidential, or subject
> to
> >> copyright belonging to the Interactions LLC. This e-mail is intended
> solely
> >> for the use of the individual or entity to which it is addressed. If you
> >> are not the intended recipient of this e-mail, you are hereby notified
> that
> >> any dissemination, distribution, copying, or action taken in relation to
> >> the contents of and attachments to this e-mail is strictly prohibited
> and
> >> may be unlawful. If you have received this e-mail in error, please
> notify
> >> the sender immediately and permanently delete the original and any copy
> of
> >> this e-mail and any printout. Thank You.
> >>
> >> ************************************************************
> >> *******************
> >>
> >> _______________________________________________
> >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> >> Corpora mailing list
> >> Corpora at uib.no
> >> https://mailman.uib.no/listinfo/corpora
> >>
> >>
> >
>
> --
>
>
> ************************************************************
> *******************
>
> This e-mail and any of its attachments may contain Interactions LLC
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to the Interactions LLC. This e-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this e-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this e-mail is strictly prohibited and
> may be unlawful. If you have received this e-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this e-mail and any printout. Thank You.
>
> ************************************************************
> *******************
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 5461 bytes
> Desc: not available
> URL: <https://mailman.uib.no/public/corpora/attachments/
> 20180319/0e58150a/attachment.txt>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 19 Mar 2018 14:24:06 -0400
> From: Alexander Fraser <fraser at cis.uni-muenchen.de>
> Subject: Re: [Corpora-List] Urdu-Hindi Transliteration corpora
> To: Nick Ruiz <nruiz at interactions.com>
> Cc: corpora at uib.no
>
> Hi Nick,
>
> Maybe relevant, maybe not:
>
> Nadir Durrani, Hassan Sajjad, Alexander Fraser, Helmut Schmid (2010).
> Hindi-to-Urdu
> Machine Translation Through Transliteration
> <http://www.cis.uni-muenchen.de/~fraser/pubs/durrani_acl2010.pdf>. In
> Proceedings of the 48th Annual Meeting of the Association for Computational
> Linguistics (ACL), pages 465-474, Uppsala, Sweden, July.
>
> Cheers, Alex
>
>
> On Sat, Mar 17, 2018 at 7:59 AM, Nick Ruiz <nruiz at interactions.com> wrote:
>
> > Hi all,
> >
> > Can you help me identify any Urdu-Hindi parallel transliteration corpora
> > that are available on the web? By transliteration, I mean strictly the
> > conversion of writing systems, not translation. Thanks in advance!
> >
> > Kind regards,
> >
> > Nicholas Ruiz
> > Interactions Labs
> >
> > ************************************************************
> > *******************
> >
> > This e-mail and any of its attachments may contain Interactions LLC
> > proprietary information, which is privileged, confidential, or subject to
> > copyright belonging to the Interactions LLC. This e-mail is intended
> solely
> > for the use of the individual or entity to which it is addressed. If you
> > are not the intended recipient of this e-mail, you are hereby notified
> that
> > any dissemination, distribution, copying, or action taken in relation to
> > the contents of and attachments to this e-mail is strictly prohibited and
> > may be unlawful. If you have received this e-mail in error, please notify
> > the sender immediately and permanently delete the original and any copy
> of
> > this e-mail and any printout. Thank You.
> >
> > ************************************************************
> > *******************
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
> >
> >
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 3211 bytes
> Desc: not available
> URL: <https://mailman.uib.no/public/corpora/attachments/
> 20180319/eb0bfef3/attachment.txt>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 20 Mar 2018 10:57:43 +0900
> From: Roussanka Loukanova <rl.stpuu at gmail.com>
> Subject: [Corpora-List] CfP: LACompLing2018 - Logic and Algorithms in
> Computational Linguistics 2018, Stockholm
> To: Roussanka Loukanova <rl.stpuu at gmail.com>
>
> CALL FOR PAPERS
>
> Symposium Logic and Algorithms in Computational Linguistics 2018
> (LACompLing2018)
> Stockholm, 28-31 August 2018
> Department of Mathematics, Stockholm University
>
> http://staff.math.su.se/rloukanova/LACompLing2018-web/
> ================================================
>
> DESCRIPTION
> ==
> Computational linguistics studies natural language in its various
> manifestations from a computational point of view, both on the theoretical
> level (modeling grammar modules dealing with natural language form and
> meaning, and the relation between these two) and on the practical level
> (developing applications for language and speech technology). Right from
> the start in the 1950ties, there have been strong links with computer
> science, logic, and many areas of mathematics - one can think of Chomsky's
> contributions to the theory of formal languages and automata, or Lambek's
> logical modeling of natural language syntax. The workshop assesses the
> place of logic, mathematics, and computer science in present day
> computational linguistics. It intends to be a forum for presenting new
> results as well as work in progress.
> --------------------------------
>
> SCOPE
> ==
> The workshop focuses mainly on logical approaches to computational
> processing of natural language, and on the applicability of methods and
> techniques from the study of artificial languages (programming/logic) in
> computational linguistics. We invite participation and submissions from
> other relevant approaches too, especially if they can inspire new work and
> approaches.
>
> The topics of LACompLing2018 include, but are not limited to:
>
> - Computational theories of human language
> - Computational syntax
> - Computational semantics
> - Computational syntax-semantics interface
> - Interfaces between morphology, lexicon, syntax, semantics, speech, text,
> pragmatics
> - Computational grammar
> - Logic and reasoning systems for linguistics
> - Type theories for linguistics
> - Models of computation and algorithms for linguistics
> - Language processing
> - Parsing algorithms
> - Generation of language from semantic representations
> - Large-scale grammars of natural languages
> - Multilingual processing
> - Data science in language processing
> - Machine learning of language
> - Interdisciplinary methods
> - Integration of formal, computational, model theoretic, graphical,
> diagrammatic, statistical, and other related methods
> - Logic for information extraction or expression in written and spoken
> language
> - Language theories based on biological fundamentals of information and
> languages
> - Computational neuroscience of language
>
> IMPORTANT DATES
> ==
> Submission deadline, regular papers: 15 May 2018 (Anywhere on Earth / AoE)
> Submission deadline, abstracts: 31 May 2018 (AoE)
> Notifications: 15 June 2018
> Final submissions: TBA
> LACompLing2018: between 28-31 Aug 2018 (few days, depending on the program)
>
> SUBMISSION INSTRUCTIONS
> ==
> We invite original, regular papers that are not submitted concurrently to
> another conference or for publication elsewhere. Abstracts of presentations
> can be on work submitted or published elsewhere.
>
> - Regular papers: maximum 10 pages, including figures and references
> - Abstracts of contributed presentations: not more than 2 pages
> - The submissions of proposed papers and abstracts have to be in pdf
> - The camera-ready submissions require the pdf and their sources
>
> Authors are required to use Springer LNCS style files. Styles and templates
> can be downloaded from Springer, for LaTeX and Microsoft:
>
> http://www.springer.com/jp/computer-science/lncs/conference-proceedings-
> guidelines
>
> The submissions are via the EasyChair management system of LACompLing2018:
>
> https://easychair.org/conferences/?conf=lacompling2018
>
> PUBLICATIONS
> ==
> - The proceedings of LACompLing2018 will be published digitally by the DiVA
> system of Stockholm University:
> http://su.diva-portal.org
>
> - Improved and extended versions of selected papers, which have been
> presented at the workshop LACompLing2018, will be published in a special
> issue of a journal after the workshop.
>
> ORGANIZERS
> ==
> Krasimir Angelov, University of Gothenburg, Sweden
> Kristina Liefke, Ludwig-Maximilians-University Munich, Germany
> Roussanka Loukanova, Stockholm University, Sweden (chair)
> Michael Moortgat, Utrecht University, The Netherlands
> Satoshi Tojo, School of Information Science, JAIST, Japan
>
> CONTACT
> ==
> Roussanka Loukanova (rloukanova at gmail.com)
> Kristina Liefke (Liefke at lingua.uni-frankfurt.de)
> --------------------------------
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 7944 bytes
> Desc: not available
> URL: <https://mailman.uib.no/public/corpora/attachments/
> 20180320/379b8990/attachment.txt>
>
> ------------------------------
>
> Message: 5
> Date: Tue, 20 Mar 2018 03:13:04 +0100
> From: Mathieu Roche <mathieu.roche at cirad.fr>
> Subject: [Corpora-List] CFP - Special Session EnGeoData -
> Environmental and Geo-spatial Data Analytics - DSAA'2018
> To: mtd-et-tetis at teledetection.fr
>
> =====================================================================
>
> IEEE DSAA'2018
>
> Special Session
> EnGeoData - Environmental and Geo-spatial Data Analytics
>
> Turin, Italy
> October 1-4, 2018
>
> http://dsaa2018.isi.it
>
> =====================================================================
>
> ---- AIMS AND TOPICS
>
> Environmental and more generally geo-spatial information is now provided
> by crowdsourcing but also by public administrations in the context of the
> open data policies. Analyses of such data are still challenging. Firstly
> because of their heterogeneity (structural, semantic, spatial and
> temporal), and secondly because of the difficulty in choosing the ?best?
> knowledge discovery process to apply, according to the needs of the experts
> in the field. This special issue aims to provide high quality research
> covering all or part of the challenges mentioned above, from a theoretical
> or experimental point of view.
>
> Challenge about data science deals with creation, storage, search,
> sharing, modeling, analysis, and visualization of data, information, and
> knowledge. In Data Science context, spatio-temporal aspects are crucial in
> order to manage and mine data, to index and retrieve information, and
> finally to discover and visualize knowledge. By taking into account these
> spatio-temporal aspects, original methods have to be proposed for
> processing real and complex data from different domains, e.g., environment,
> agriculture, health, urban, and so forth.
>
> Topics:
> - Pre and post processing of environmental and agriculture data
> - Geographical information retrieval
> - Spatial data mining and spatial data warehousing
> - Knowledge discovery use-cases dedicated to environmental data
> - Spatial text mining
> - Spatial ontology
> - Spatial recommendations and personalization
> - Visual analytics for geospatial data
> - Dedicated applications:
> * Spatio-temporal analytics platform
> * Agricultural Decision Support Systems
> * Urban traffic systems
> * Trajectory analysis
> * Land-use and urban policies
> * Land-use and urban planning analysis
> * Spatio-temporal analysis in Ecology and Agriculture
> * and so forth
>
> ---- SUBMISSION WEBSITE:
>
> The submissions Web site for DSAA 2018 Special Sessions is Easy Chair (
> https://easychair.org/conferences/?conf=dsaa2018) that is the same as the
> submission Web site for the main conference track.
>
> ---- IMPORTANT DATES:
>
> Special Session Paper Submission: May 25, 2018
> Notification of acceptance: July 20, 2018
> Camera-Ready: Aug. 3, 2018
> Early Registration: Aug. 13, 2018
>
> ---- PUBLICATIONS:
>
> Special session papers follow the same format as the conference papers.
> The paper length allowed is a maximum of ten (10) pages, in 2-column U.S.
> letter style using IEEE Conference template (see the IEEE Proceedings
> Author Guidelines: http://www.ieee.org/conferences_events/
> conferences/publishing/templates.html).
>
> All submissions will be blind reviewed by the Program Committee on the
> basis of technical quality, relevance to conference topics of interest,
> originality, significance, and clarity. Author names and affiliations must
> not appear in the submissions, and bibliographic references must be
> adjusted to preserve author anonymity. Submissions failing to comply with
> paper formatting and authors anonymity will be rejected without reviews.
>
> All accepted papers, including main tracks and special sessions, will be
> published by IEEE and will be submitted for inclusion in the IEEE Xplore
> Digital Library.
>
> Top quality papers accepted and presented at the conference will be
> selected for extension and invited to a special issue of International
> Journal of Data Science and Analytics (JDSA, Springer). A special issue
> associated with EnGeoData sessions has been published in 2018:
> https://link.springer.com/journal/41060/5/2/page/1
>
>
> --- ORGANIZERS:
>
> Diana Inkpen, Unversity of Ottawa, Canada
> Mathieu Roche, Cirad, TETIS, France
> Maguelonne Teisseire, Irstea, TETIS, France
>
>
>
>
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 19 Mar 2018 22:49:54 -0400
> From: Anna Feldman <feldmana at mail.montclair.edu>
> Subject: [Corpora-List] CFP: COLING 2018 First Workshop on Natural
> Language Processing for Internet Freedom (NLP4IF)
> To: corpora at uib.no
>
> ********************* Call for
> Papers******************************************************
> ***************************
>
>
> COLING 2018 First Workshop on Natural Language Processing for Internet
> Freedom (NLP4IF)
>
> August 20 or August 21, 2018, Santa Fe, New Mexico.
>
> Invited speakers: TBA
>
> (Support from the US National Science Foundation allows us to offer
> domestic
> travel grants to student participants.)
>
>
> https://cbrew.github.io/nlp4if/
>
>
> According to the recent report produced by Freedom House (freedomhouse.org
> ),
> an ?independent watchdog organization dedicated to the expansion of freedom
> and democracy around the world?, Internet freedom declined in 2016 for the
> sixth consecutive year. 67% of all Internet users live in countries where
> criticism of the government, military, or ruling family are subject to
> censorship. Social media users face unprecedented penalties, as authorities
> in 38 countries made arrests based on social media posts over the past
> year. Globally, 27% of all internet users live in countries where people
> have been arrested for publishing, sharing, or merely ?liking? content on
> Facebook. Governments are increasingly going after messaging apps like
> WhatsApp and Telegram, which can spread information quickly and securely.
>
>
> Various barriers exist to prevent citizens of a large number of countries
> to access information. Some involve infrastructural and economic barriers,
> others violations of user rights such as surveillance, privacy and
> repercussions for online speech and activities such as imprisonment,
> extralegal harassment or cyberattacks. Yet another area is limits on
> content, which involves legal regulations on content, technical filtering
> and blocking websites, (self-)censorship.
>
>
> Large internet providers are effective monopolies, and themselves have the
> power to use NLP techniques to control information flow. Users are
> suspended or banned, sometimes without human intervention, and with little
> opportunity for redress. Users react to this by using coded, oblique or
> metaphorical language, by taking steps to conceal their identity such as
> the use of multiple accounts, raising questions about who the real
> originating author of a post actually is.
>
>
> This workshop should bring together NLP researchers whose work contributes
> to the free flow of information on the Internet. The topics of interest
> include (but are not limited) to the following:
>
>
>
> - Censorship detection: detecting deleted or edited text; detecting
> blocked keywords/banned terms;
> - Censorship circumvention techniques: linguistically inspired
> countermeasure for Internet censorship such as keyword substitution,
> expanding coverage of existing banned terms, text paraphrasing,
> linguistic
> steganography, generating information morphs etc.;
> - Detection of self-censorship;
> - Identifying potentially censorable content;
> - Disinformation/Misinformation detection: fake news, fake accounts,
> rumor detection, etc.;
> - Techniques to empirically measure Internet censorship across
> communication platforms;
> - Investigations on covert linguistic communication and its limits;
> - Identity and private information detection;
> - Passive and targeted surveillance techniques;
> - Ethics in NLP;
> - ?Walled gardens?, personalization and fragmentation of the online
> public space;
>
>
> We hope that our workshop will promote Internet freedom in countries where
> accessing and sharing of information are strictly controlled by censorship.
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 11050 bytes
> Desc: not available
> URL: <https://mailman.uib.no/public/corpora/attachments/
> 20180319/17856815/attachment.txt>
>
> ----------------------------------------------------------------------
> Send Corpora mailing list submissions to
> corpora at uib.no
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mailman.uib.no/listinfo/corpora
> or, via email, send a message with subject or body 'help' to
> corpora-request at uib.no
>
> You can reach the person managing the list at
> corpora-owner at uib.no
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Corpora digest..."
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
> End of Corpora Digest, Vol 129, Issue 30
> ****************************************
>

-- Regards; Hassan Sajjad -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 32896 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180320/b4db0ab1/attachment.txt>



More information about the Corpora mailing list