[Corpora-List] 2nd CfP: Special track on the Syntactic Analysis of Non-Canonical Language

irehbein at uni-potsdam.de irehbein at uni-potsdam.de
Tue Apr 8 08:30:43 CEST 2014


Special track on the Syntactic Analysis of Non-Canonical Language =================================================================

ENDORSED BY SIGPARSE

The SANCL special track will be part of the Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages - SPMRL-SANCL 2014

Co-located with COLING 2014, August 24 in Dublin, Ireland

Important dates (updated!)

Submission deadline: June 06, 2014 Author notification: July 01, 2014 Camera-ready deadline: July 13, 2014 Workshop: Aug 24, 2014

Main workshop: http://www.spmrl.org/spmrl-sancl2014.html

SANCL Special Track: http://www.spmrl.org/sancl-posters2014.html

SANCL Poster submissions ======================== In addition to regular paper submissions, we solicit poster submissions addressing the syntactic analysis of frequent phenomena of non-canonical languages which are difficult to annotate and parse using conventional annotation schemes. A case in point are the representation of verbless utterances in a dependency scheme, the pros and cons of different representations of disfluencies for statistical parsing, or the analysis of complex hashtags which incorporate and merge different syntactic arguments into one token.

Poster submissions should focus on one or more of the topics listed below. They should either be submitted as a short paper (up to 7 single-column pages + references, to be included in the proceedings and presented as a poster at the workshop) or be submitted as an abstract (max. 500 words excluding examples/references, to be presented as a poster at the workshop). Abstract submissions should sketch an analysis for a given problem while short paper submissions should also present at least preliminary experimental results showing the feasibility of the approach.

Topics for poster submissions:

Unit of analysis ================ For canonical, written text the relevant unit for syntactic analysis is defined by the sentence boundaries. In CMC (computer mediated communication), on the other side, sentence boundaries are not always marked in a systematic way, and for spoken language, we can not revert to sentence boundaries at all. Decisions concerning the relevant unit of analysis will influence corpus-linguistic research (e.g. measures like sentence length, syntactic complexity) as well as parsing results. On the token level, it is also not clear what should be used as the unit of analysis. In spoken language as well as in conceptually spoken registers like CMC, multiple tokens are often merged into one new token (2,4-6), or long compound words are split into separate units (5). It is not yet clear whether it is preferable to address these issues during preprocessing, e.g. by tokenizing and normalising the text, or whether this would result in a "lossy translation", as argued by Owoputi et al. 2013, which should be avoided.

(1) @Hii_ImFruiity nuin much at all juss chillin waddup w yu ?

-- Owoputi et al. 2013: OCT27 data set

We ask for contributions on the optimal unit of analysis for non-canonical languages which do not come already separated into sentence-like units (e.g. spoken language, tweets, historical data), and for contributions on best practices for tokenizing spoken language and CMC.

Elliptical structures and missing elements ========================================== Non-canonical languages often include sentences where syntactic arguments are not expressed at the surface level. This raises the question how we can provide a meaningful analysis for these structures, especially in a dependency grammar framework. One way to deal with the problem is to insert missing predicates as dummy verbs into the tree to be able to provide a dependency analysis for these structures (e.g. Seeker & Kuhn 2012; Dipper, Lüdeling & Reznicek 2013, see NoSta-D annotation guidelines). The question remains whether this approach is feasible for automatic processing, especially for the highly underspecified and ambiguous input often provided by NCLs, or whether a constituency-based analysis offers more elegant means to analyse elliptical structures.

We ask for contributions discussing the optimal representation for elliptical structures.

(2) Doesn't change the result though. -- From DCU's Football Treebank

Hashtags & friends ================== Newly emerging text types from the Social Media have triggered new, creative means of communication which help users to overcome the limitations of expressing themselves in a written medium. Twitter hashtags are one case in point, not only allowing the users to add a semantic tag to their tweet, but also to add comments, context information, irony and sarcasm, to express personal feelings, or to evaluate. Formally, they are not bound to one particular part-of-speech but can include whole phrases or sentences, which implies that the common practise to tag them using the the label HASHTAG does not do them justice. This is even more so the case for hashtags encoding one or more arguments of the predicate, as in (10). Hashtags provide a rich source of information which has already been exploited in sentiment analysis and opinion mining (e.g. Mohammad et al. 2013, Kunneman et al 2013; also see http://www.newyorker.com/online/blogs/susanorlean/2010/06/hash.html for an overview of the different functions of hashtags). We are interested in approaches towards a syntactic analysis of hashtags (and related phenomena such as complex inflective constructions in German CMC (Schlobinski 2001)) which allow us to make better use of the information encoded in hashtags. What are the new challenges for analysing these phenomena? What can be learned from research on similar phenomena, e.g. on MWE?

(3) #itsnothebeer I don't like but the taste -- From Twitter

Disfluencies ============ Disfluencies (e.g. fillers, repairs) are a common phenomenon in spoken language and also occur in written, but conceptually spoken language such as CMC.

(4) He uh graduated from medical school this year and uh, I mean he's

in uh, ... Soho in New York.

-- SBC046, Du Bois et al. 2000: Santa Barbara corpus of spoken

American English

There are different ways of representing disfluencies. In the Switchboard corpus, fillers are included in the tree, and for repairs, both the repair and the reparandum are attached to the same node. In the German Verbmobil treebank, fillers have been removed and so-called speech errors and repetitions are not integrated in the tree but instead are attached to the root node. The different representations are expected to have an impact on statistical parsing as well as on the usefulness of the resources for linguistic research.

We ask for contributions discussing the best way of representing disfluencies in the syntax tree.

Code mixing =========== In informal spoken language as well as in CMC, a considerable amount of the data includes code mixing. This provides a huge challenge for automatic processing, and even more so as there is no agreed upon theoretical distinction between loanwords and foreign words. Should we annotate foreign language material using the same annotation scheme as for the target language, especially in cases where the grammatical differences between the languages involved do not easily allow us to do so?

(5) es tut mir so leid vallah ich wollte kommen ama unuttum

it does me so harm my God I wanted come but forget-pst-1-sg

"I am so sorry, really, I wanted to come but I forgot"

-- From Twitter

We ask for contributions discussing best practices for the syntactic analysis of code mixing.

For more examples and information, please visit: http://www.spmrl.org/sancl-posters2014.html

SANCL Special Track Organizers

Ozlem Cetinoglu (IMS, Germany) Ines Rehbein (Postdam University, Germany) Djamé Seddah (Université Paris Sorbonne & Inria's Alpage project) Joel Tetreault (Yahoo! Labs, US)



More information about the Corpora mailing list