HLT/NAACL-04 - Tutorials


			HLT/NAACL 2004

			Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting


	Tutorials

May 2-7, 2004
The Park Plaza Hotel,
one block from the
Boston Common

Finite-state Language Processing

Shuly Wintner
http://cs.haifa.ac.il/~shuly

Finite-state technology is becoming an invaluable tool for various levels of language processing. It is the computational means of choice for describing the phonology, lexicon and morphology of natural languages, but is used more and more for other purposes as well, including (shallow) parsing, word-level translation, named entity recognition, etc.

The tutorial will provide an introduction to the technology and its many applications in natural language processing. It starts with the very basics of finite-state devices and regular expressions and concludes with a sketch of how to design and implement a large-scale project. Several examples of real applications illustrate the formal material.

Tutorial Outline

Finite-state automata (FSA)
Regular expressions
Operations on automata
Applications of FSA in NLP
1. Storing lexicons
Regular relations
Finite-state transducers (FSTs)
Properties of FSTs
Applications of FSTs in NLP
1. Morphological analysis
2. Part of speech tagging
3. Translation dictionaries
Extended regular expression languages
Replace rules and composition
Applications
1. Markup
2. Morphological analysis and generation
3. Shallow parsing
Available tools

Target Audience

This tutorial is designed for computer scientists and linguistics alike. Acquaintance with basic formal language theory and knowledge of some programming language will be useful, but not mandatory.

Shuly Wintner is an assistant professor in the Department of Computer Science at the University of Haifa, Israel. His research involves adaptation of computer science techniques and paradigms to computational linguistics, with an emphasis on formal grammars and finite-state devices.

What's New in Statistical Machine Translation

Kevin Knight and Philipp Koehn
http://www.isi.edu/~knight
http://www.isi.edu/~koehn

Accurate translation requires a great deal of knowledge about the usage and meaning of words, the structure of phrases, the meaning of sentences, and which real-life situations are plausible. Recently, there has been a fair amount of research into extracting translation-relevant knowledge automatically from large collections of manually-translated texts, and over the past years, several statistical MT projects have appeared in North America, Europe, and Asia, and the literature is growing substantially. We will overview this progress.

Tutorial Outline

Data for MT.
1. Bilingual corpora: what's out there?
2. Acquisition and cleaning.
3. What does three million words really mean?
MT Evaluation.
1. Manual and automatic.
Core Models and Decoders
1. IBM Models 1-5 and HMM models, training, decoding.
2. Word alignment and its evaluation.
3. Phrase models.
4. Syntax-based translation and language models.
Specialized Models.
1. Named entity MT, numbers and dates, morphology, noun phrase MT.
Available Resources.
1. tools and data.
Bibliography

Target Audience

The target audience for this tutorial is anyone interested in machine translation of human languages.

Kevin Knight is a Senior Research Scientist at the USC/Information Sciences Institute and a Research Associate Professor in the Computer Science Department at USC. He has written a number of articles on statistical MT, plus a widely-circulated MT workbook (http://www.isi.edu/natural-language/mt/wkbk.rtf). Dr. Knight has given several invited talks on machine translation at recent AMTA and EMNLP conferences.

Philipp Koehn completed his Ph.D. in Computer Science at the University of Southern California in Fall 2003. He has written a number of articles on topics in statistical machine translation, including bilingual lexicon induction from monolingual corpora, word-level translation models, and translation with scarce resources. He has also worked at AT&T Laboratories on text-to-speech systems, and at WhizBang! Labs on text categorization.

Semantic Inference for Question Answering

Sanda Harabagiu and Srini Narayanan
http://hlt.udallas.edu/~sanda
http://www.icsi.berkeley.edu/~snarayan

The AQUAINT QA program has provided solid evidence that potential users of QA systems appear to have limited need for factoid question answering, but rather much more need to have systems that can deal with complex reasoning about causes, effects, chains of hypotheses and so on -- capabilities that current systems do not adequately support. Approaching this goal requires combining sophisticated systems for knowledge representation and inference with methods to extract such deep semantic relations from linguistic input. We believe that recent important advances in knowledge representation and inference, the widespread availability of semantically motivated resources such as WordNet, FrameNet, and successful recent efforts at textual analysis including predicate-argument extraction point the way to building the next generation of semantically rich QA systems. This tutorial will serve as a survey of important recent progress on semantically-based QA articulating connections and highlighting efforts that have brought one or more of these techniques to bear on QA system design and development.

Tutorial Outline

Methods to extract semantic relations from text.
1. Statistical techniques.
2. Knowledge intensive techniques.
3. Supervised and unsupervised learning techniques.
Knowledge representation and inference techniques for QA.
1. Logical inference Methods.
2. Structured Probabilistic Methods.
  1. Probabilistic Relational Models for inference with uncertainty.
  2. Models of Event Structure.
Ontologies and Linguistic resources for QA.
1. Linguistic resources.
  1. WordNet.
  2. FrameNet.
  3. PropBank.
2. Ontologies and resources on the SemanticWeb.
  1. OpenCYC.
  2. OWL.
  3. OWL-S.

Target Audience

This tutorial is designed for computer scientists and linguistics alike. Acquaintance with statistical techniques and Knowledge Representation will be useful, but not mandatory.

Dr. Sanda Harabagiu is an Associate Professor and the Erik Jonsson School Research Initiation Chair in the Department of Computer Science at University of Texas at Dallas. She earned her Phd in Computer Engineering from University of Southern California and a Research Doctorate from Tor Vergata University in Rome, Italy. Her research interests are in the area of Question Answering, Information Extraction, Reference Resolution and Text Summarization.

Dr. Srini Narayanan is a Senior Research Scientist at the International Computer Science Institute (ICSI), Berkeley where he is a co-PI with the NTL (http://www.icsi.berkeley.edu/NTL) and FrameNet (http://www.icsi.berkeley.edu/~framenet) projects. He obtained his PhD in Computer Science from the University of California, Berkeley in 1998. His research interests include computational semantics and metaphor, probabilistic dynamic models, and computational neuroscience.

Graphical Models in Speech and Language Research

Jeff Bilmes
http://ssli.ee.washington.edu/~bilmes

Graphical models (GMs) are a general statistical abstraction that can be used to describe a wide variety of problem domains. Recently, significant research has occurred on their application to speech and language processing. GMs offer a mathematically formal but widely flexible means for solving many of the problems encountered in these fields. Because of their generality, GMs make it possible to rapidly go from novel idea to working implementation. In this advanced tutorial, we will survey how GMs can be used to represent structures and models in speech and language.

We start with concepts and notation, including an inspection of different forms of graphical models, and some intriguing constructs these forms make available. This includes the notion of a "switching network", where one portion of a network might determine the existence of another, "sparse dependencies", where many combinations of variable values are forced to have zero probability, and "child observations", where influence can flow in the opposite direction of a directed edge in a graph. We will in general see how GMs can be viewed as a mathematically formal visual language, offering a precise set of primitives for specifying statistical systems. We will continue with an analysis of algorithms for performing probabilistic inference on graphs, concentrating on both theory (e.g., when is inference tractable) and practice (data structures and implementation). We will give special attention to the challenges that arise when the underlying domain is temporal.

Next, we will examine the ways GMs can represent speech and language. This will include explicit representations of hierarchical and temporal phenomena such as parameter sharing, multi-stream models with varying degrees of asynchrony, and classifier combination. We will see how these can be used to represent speech evolution in terms of both phonology and articulation. We will also cover graphical representations of language, including explicit structures for N-grams, interpolation, skipping, hierarchical classes, smoothing, back-off, factored representations, and other forms. Furthermore, we will investigate how to describe statistical machine translation via novel multi-dynamic graph representations.

While graphs not only can represent many well-known statistical models, with only minor graph adjustments they can also represent very different (and potentially novel) systems. We will observe how deterministic dependencies, switching networks, and child observations greatly facilitate this phenomenon. Moreover, we will see how a graph's associated inferential machinery can shield a user from needing to "reinvent the wheel" each time it is desired to investigate a new model.

Lastly, we will briefly survey available GM toolkits and their features. We will include a comparison of GM technology with its modern alternatives. Tutorial attendees will thus learn not only how to use GMs, but also how to decide when and where GM technology is best applied.

Tutorial Outline

Overview and Motivation.
Different GM types, constructs, and structures.
Theory and practice of probabilistic inference in Dynamic GMs.
Explicit representations of temporal structures.
Graphical models of speech.
Graphical models of language.
Graphical models of statistical machine translation.
GM Toolkits.
GM technology vs. its alternatives.

Target Audience

This tutorial will assume a basic knowledge of standard language and speech processing, including knowledge of hidden Markov models, maximum entropy models, and the many techniques that go into making such models successful. It will also be assumed that the audience is comfortable with basic statistical terminology.

Jeff A. Bilmes is an Assistant Professor in the Department of Electrical Engineering at the University of Washington, Seattle (adjunct in Linguistics and in Computer Science and Engineering). He co-founded the Signal, Speech, and Language Interpretation Laboratory at the University. He received a masters degree from MIT, and a Ph.D. in Computer Science at the University of California in Berkeley. Jeff is an author of the graphical models toolkit (GMTK), and was a leader of the 2001 Johns Hopkins summer workshop team applying graphical models to speech and language. His primary research lies in statistical graphical models, speech, language and time series processing, human-computer interfaces, and probabilistic machine learning.

Large Scale Spoken Document Retrieval

Pedro J. Moreno and Jean Manuel Van Thong
http:///www.hpl.hp.com/personal/Pedro_Moreno
http:///www.hpl.hp.com/personal/Jean-Manuel_Van_Thong

Search engines like Google or Yahoo have been extremely successful over the years in facilitating the search and retrieval of text pages and written documents. However, only recently these technologies have been extended to spoken documents. While there are many similarities with standard text search engines, spoken document retrieval is sufficiently different.

In this tutorial we provide an introduction to the field of spoken document retrieval with an special emphasis on large audio collections. We will start with a general introduction to speech recognition, then continue with various approaches to audio indexing and then continue with a global description of the architecture needed for large scale indexing. We will conclude with several demos of existing engines and technologies.

Tutorial Outline

Extracting metadata from raw audio.
1. Fundamentals of speech recognition.
  1. Acoustic modeling.
    1. Words based, phone based.
  2. Language modeling
  3. search
2. Speech recognition approaches for audio indexing.
  1. Phonetic search.
  2. Word spotting approaches.
  3. Large vocabulary speech recognition.
  4. Syllable based speech recognition.
  5. Limitations and advantages of all approaches.
  6. The out-of-vocabulary (OOV) problem.
3. Text audio alignment.
Indexing and searching metadata.
1. Searching versus indexing.
2. Content segmentation.
3. Modification to text indexing, long documents vs. short documents.
4. Index fusion approaches.
5. Acoustic search versus semantic search.
Architecture design for large scale indexing.
1. The web search model for audio indexing.
  1. Audio (and video) crawling.
  2. Audio to text transcription.
  3. Index construction.
2. API's for querying and index update.
3. The user interface design.
4. Putting everything together.
Demos of several systems.
Conclusions: Where is audio indexing headed?

Target Audience

This tutorial is designed for information retrieval and computer scientists with no previous knowledge of speech recognition and information retrieval.

Pedro J. Moreno is a senior researcher at the Cambridge Research Lab, which is part of Hewlett-Packard Labs. His main interests are in the practical applications of machine learning techniques in several fields such as as audio indexing, image retrieval, text classification and noise robustness. Dr. Moreno has being involved in the design of HP Labs audio indexing engine SpeechBot. Lately his main interests are in the areas of bioinformatics and bio signal interpretation.

JM Van Thong is a senior researcher at the Cambridge Research Lab, which is part of Hewlett-Packard Labs. His current research interests are bioinformatics, media indexing, and information retrieval systems as well as user interfaces. During his 17 years spent in research, JM has been involved in several successful projects including SpeechBot, the first large scale web audio indexing system, RedBot, a web-based tool for automatic red-eye correction, an information retrieval system for hand-helds, a real-time streaming phoneme recognizer for a facial animation package, and planar maps technology for a sketching software.

Statistical Language Models and Information Retrieval

ChengXiang Zhai
http://www-faculty.cs.uiuc.edu/~czhai

Statistical language models play an important role in virtually all kinds of tasks involving human language technologies. In particular, they have been attracting much attention recently in the information retrieval community due to their theoretical and empirical advantages over traditional retrieval methods. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning, open up possibilities for modeling non-traditional retrieval problems, and in general provide a more principled way of modeling retrieval problems.

The purpose of this tutorial is to systematically review the recent progress in applying statistical language models to information retrieval with an emphasis on the underlying principles and framework, empirically effective language models, and language models developed for non-traditional retrieval tasks. Tutorial attendees can expect to learn the major principles and methods of applying statistical language models to information retrieval, the outstanding problems in this area, as well as obtain comprehensive pointers to the research literature.

Tutorial outline

Introduction
1. Information Retrieval (IR)
2. Statistical Language Models (SLMs)
3. Applications of SLMs to IR
The Basic Language Modeling Approach
1. Query likelihood methods and their justification
2. Smoothing of language models
3. Improving the basic language modeling approach
Feedback Language Models
1. Different ways of feedback with language models
2. Representative feedback models (relevance/query models, translation models)
Language Models for different retrieval tasks
1. Cross-language retrieval
2. Distributed information retrieval
3. TDT and information filtering
4. Semi-structured information retrieval
5. Subtopic retrieval
A General Framework for Applying SLMs to IR
Summary
1. SLMs vs. traditional methods: Pros & Cons
2. Progress so far
3. Challenges and future research directions

Target Audience

The tutorial should appeal to both people working on information retrieval with an interest in applying more advanced language models and those who have a background on statistical language models and wish to apply them to information retrieval. Attendees will be assumed to know basic probability and statistics.

ChengXiang Zhai is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign. He received a Ph.D. in Computer Science from Nanjing University in 1990, and a Ph.D. in Language and Information Technologies from Carnegie Mellon University in 2002. He worked at Clairvoyance Corp. as a Research Scientist and, later, a Senior Research Scientist from 1997 to 2000. His research interests broadly include information retrieval, natural language processing, machine learning, and bioinformatics. His most recent work, including his dissertation, is centered on developing formal retrieval frameworks and applying statistical language models to text retrieval, especially in directions such as personalized search and semi-structured information retrieval. He has served on the program committee for ACM SIGIR 2003, ACM SIGIR 2004, ACL 2003, ACM CIKM 2003. He is the IR program co-chair for ACM CIKM 2004. He is a recipient of the 2004 NSF CAREER award.

Comments? Suggestions? Contact the Webmaster Last modified: Thu Mar 25 09:29:35 2004

HLT/NAACL 2004

Tutorials

May 2-7, 2004 The Park Plaza Hotel, one block from theBoston Common

Finite-state Language Processing

Tutorial Outline

Target Audience

What's New in Statistical Machine Translation

Tutorial Outline

Target Audience

Semantic Inference for Question Answering

Tutorial Outline

Target Audience

Graphical Models in Speech and Language Research

Tutorial Outline

Target Audience

Large Scale Spoken Document Retrieval

Tutorial Outline

Target Audience

Statistical Language Models and Information Retrieval

Tutorial outline

Target Audience

May 2-7, 2004
The Park Plaza Hotel,
one block from the
Boston Common