The NLP Beyond Text 2021 organizers invite you to participate to the second edition of the workshop on multi- and cross-modal NLP that will be hosted at the Web Conference 2021 on Thursday, April 15th from 8AM to 12AM PST (3pm to 7pm UTC). In addition to paper presentations, during the workshop there will be 4 keynote speeches from leading researchers in the field, including Rada Mihalcea, Jason Baldridge, Desmond Elliott and Raquel Fernández. The full schedule is available at https://sites.google.com/view/nlpbt-2021/program
Best, NLPBT 2021 organizers
How to Attend: The workshop will be hosted on the MiTeam platform of the Web Conference 2021. To participate you need to register at the web conference https://www2021.thewebconf.org/attendees/
Speaker: Rada Mihalcea (University of Michigan) Title: Challenges (and Opportunities) in Multimodal Sensing of Human Behavior Abstract: Much of what we do today is centered around humans — whether it is creating the next generation smartphones, understanding interactions with social media platforms, or developing new mobility strategies. A better understanding of people can not only answer fundamental questions about “us” as humans, but can also facilitate the development of enhanced, personalized technologies. In this talk, I will overview the main challenges (and opportunities) faced by research on multimodal sensing of human behavior, and illustrate these challenges with projects conducted in the Language and Information Technologies lab at Michigan.
Speaker: Jason Baldridge (Google) Title: Language, vision and action are better together Abstract: Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language---including speech---in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase “how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.
Speaker: Desmond Elliott (University of Copenhagen) Title: Beyond Text and Back Again Abstract: A talk with two parts covering three modalities. In the first part, I will talk about NLP Beyond Text, where we integrate visual context into a speech recognition model and find that the recovery of different types of masked speech inputs is improved by fine-grained visual grounding against detected objects. In the second part, I will come Back Again, and talk about the benefits of textual supervision in cross-modal speech--vision retrieval models.
Speaker: Raquel Fernández (University of Amsterdam) Title: Grounding language in visual and conversational contexts Abstract: Most language use is driven by specific communicative goals in interactive setups, where often visual perception goes hand in hand with language processing. I will discuss some recent projects by my research group related to modelling language generation in socially and visually grounded contexts, arguing that such models can help us to better understand the cognitive processes underpinning these abilities in humans and contribute to more human-like conversational agents.
Organizing Committee: Loďc Barrault (University of Sheffield) Erik Cambria (Nanyang Technological University) Giuseppe Castellucci (Amazon) Simone Filice (Amazon) Elman Mansimov (New York University) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4704 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20210412/cc133c89/attachment.txt>