[Corpora-List] Proposal for initiating a global non continentalArabic Language Academy

Linas Vepstas linasvepstas at gmail.com
Tue Nov 17 04:09:54 CET 2009


2009/11/16 Linas Vepstas <linasvepstas at gmail.com>:
> Hi,
>
> 2009/11/16 El-Haj, Mahmoud <melhaj at essex.ac.uk>:
>> Dear Hamed,
>>
>>  Arabic shares characteristics with other living languages, which could be very helpful as Arabic NLP still in it's preliminary phases comparing to others.
>
> There may be more to this than it seems. I notice that
> Regina Barzilay/MIT CSAIL: is giving seminars entitled
>  "Embracing Language Diversity: Unsupervised Multilingual
> Learning?"  wherein I think she tries to establish the idea that
> unsupervised training of parsers works even better if you
> train them on multiple languages, simultaneously. (or something
> like that, I haven't seen the lectures yet.)
>
> The upshot is that I think that it would be a mistake to pretend
> that Arabic NLP is an island, needing only short-term help until
> it catches up with other languages. More likely, one has to
> consider in context, and fully embrace "comparative linguistics",
> in both its traditional, and now,, maybe more modern senses.

Specifically, the below, which presumably will mention ties between ancient languages and modern Arabic. "I" is Regina.

Talk Title: "Embracing Language Diversity: Unsupervised Multilingual Learning?"

Talk Abstract: For centuries, the deep connection between human languages has fascinated scholars, and driven many important discoveries in linguistics and anthropology. In this talk, I will show that this connection can empower unsupervised methods for language analysis. The key insight is that joint learning from several languages reduces uncertainty about the linguistic structure of each individual language.

I will present multilingual generative unsupervised models for morphological segmentation, part-of-speech tagging, and parsing. In all of these instances we model the multilingual data as arising through a combination of language-independent and language-specific probabilistic processes. This feature allows the model to identify and learn from recurring cross-lingual patterns to improve prediction accuracy in each language. I will also discuss ongoing work on unsupervised decoding of ancient Ugaritic tablets using data from related Semitic languages.

This is joint work with Benjamin Snyder, Tahira Naseem and Jacob Eisenstein.



More information about the Corpora mailing list