Tutorial 4: Multimodel Language Processing

The ongoing convergence of the web with telephony, through technologies such as Voice over IP, high-speed mobile data networks, and handheld computers and smartphones, enables the creation of natural and highly effective multimodal interfaces for human-human communication and human-machine interaction with automated services. These interfaces allow for user input and system output to be optimally distributed over multiple different modes such as speech, pen, and graphical displays. Research on the computational processing of language has primarily focussed on linear sequences of speech or text where the primitive elements are phonemes, morphemes, or words. Multimodal language can be distributed over two or three spatial dimensions as well as the temporal dimension and involve additional primitive elements such as gestures, drawings, tables, and charts. This tutorial provides an overview of the problem of multimodal language processing and detailed examples showing how representations and techniques from natural language and dialog processing can be extended and applied to the parsing, integration, understanding of multimodal inputs and the planning, generation, and presentation of multimodal outputs.

This tutorial is intended for students, researchers, and practioners in natural language and speech processing who want to see how many of the grammar and corpus-based techniques developed within the community can be applied to the creation of real-world multimodal interactive systems. It is introductory in nature and no special knowledge or background is required. The tutorial will also provide an overview of emerging standards that support multimodal interaction and will finish with presentation of how multimodal integration, dialog management and generation all work together in a sample multimodal application.

TUTORIAL OUTLINE

Introduction

Definition and motivation for multimodal user interfaces
Examples of multimodal user interfaces: Video demonstrations
Language processing architectures for multimodal

Unification-based multimodal integration and parsing

Multimodal integration as unification
Unification-based multimodal grammars
Multidimensional parsing

Finite-state methods for multimodal understanding

Representation of input streams
Multimodal grammars
Implementation using finite-state methods
Integration of multimodal grammars with recognition

Robust multimodal input processing

Robustness in spoken and multimodal language processing
Edit machines
Multimodal understanding as classification
Learning edit machines using machine translation

Multimodal dialog management

Representation of multimodal dialog context
Clarification in multimodal dialog
Mode-independent dialog management

Multimodal output generation

Multimodal content planning
Media synchronization
Generation of non-verbal behaviors

Standards for multimodal interfaces

Speech GUI Integration: X+V and SALT
EMMA: Extensible MultiModal Annotation

Multimodal applications and challenges

Sample prototype multimodal application
Incrementality and adaptivity

MICHAEL JOHNSTON is a Senior Technical Specialist in the IP and Voice-enabled services research lab of AT&T Labs - Research. His research interests span natural language processing, spoken and multimodal interactive systems, and human-computer interaction. For the last ten years, his work has focussed on the extension of language and dialog processing technologies to support multimodal interaction. In 1999, Dr. Johnston was awarded an NSF CAREER award for research on multimodal language processing for natural interfaces. He is also active in the creation of standards supporting spoken and multimodal interface development and serves as editor-in-chief of the World Wide Web consortium EMMA: Extensible Multimodal Annotation specification. Dr. Johnston is a member of the IEEE Speech and Language technical committee (2006-2008), was an area chair for ACL 2004, and has served as a program committee member and reviewer for numerous international conferences, journals, and workshops.

SRINIVAS BANGALORE is a Senior Technical Specialist in the IP and Voice-enabled services research lab of AT&T Labs - Research. His research areas include speech and language processing topics related to parsing, machine translation, multimodal integration, and finite-state methods. His dissertation was on a robust parsing approach called Supertagging that combines the strengths of statistical and linguistic models of language processing. During the past ten years, some of the topics he has worked on include tightly coupling speech recognition and language translation using finite-state speech translation approaches, supertag-based surface realizer for natural language generation, and finite-state based multimodal integration and understanding. Dr. Bangalore has been on the editorial board of Computational Linguistics Journal (2001-2003), the workshop chair for ACL 2004, member of IEEE Speech Technical Committee (2006-2008) and has served as a program committee member for a number of ACL and IEEE conferences and workshops.

Main

Tutorial 4: Multimodel Language Processing

TUTORIAL OUTLINE