ACL-IJCNLP 2009, 2-7 August 2009, Singapore

Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and
the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing

Tutorials

ACL-IJCNLP 2009 Tutorials at a Glance, 2 August, 2009 (Sunday), Suntec Level 2
	MR202	MR208	MR209
08:30-10:00	T1	T2	T3
10:00:10:30	Coffee/Tea Break
10:30-12:00	T1	T2	T3
12:00-14:00	Lunch
14:00-15:30	T4	T5	T6
15:30-16:00	Coffee/Tea Break
16:00-17:30	T4	T5	T6

T1: Fundamentals of Chinese Language Processing

Date/Time: Morning, 2 Aug 2009
Venue: MR202
Presenters/organisers: Chu-Ren Huang and Prof. Qin LU (both from The Hong Kong Polytechnic University)

T2: Topics in Statistical Machine Translation

Date/Time: Morning, 2 Aug 2009
Venue: MR209
Presenters/organisers: Kevin Knight (USC/Information Sciences Institute) Philipp Koehn (University of Edinburgh)

T3: Semantic Role Labeling: Past, Present and Future

Date/Time: Morning, 2 Aug 2009
Venue: MR208
Presenter/organiser: Lluís Màrquez (Universitat Politècnica de Catalunya)

T4: Computational Modeling of Human Language Acquisition

Date/Time: Afternoon, 2 Aug 2009
Venue: MR202
Presenter/organiser: Afra Alishahi (Saarland University)

T5: Learning to Rank

Date/Time: Afternoon, 2 Aug 2009
Venue: MR208
Presenter/organiser: Hang Li (Microsoft Research Asia)

T6: State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes

Date/Time: Afternoon, 2 Aug 2009
Venue: MR209
Presenter/organisers: Simone Paolo Ponzetto (University of Heidelberg) and Massimo Poesio (Universita' di Trento)

T1: Fundamentals of Chinese Language Processing

Date/Time: Morning, 2 Aug 2009
Presenters/organisers: Chu-Ren Huang and Prof. Qin LU

Abstract

This tutorial gives an introduction to the fundamentals of Chinese language processing for text processing. Computer processing of Chinese text requires the understanding of both the language itself and the technology to handle them. The tutorial contains two parts. The first part overviews the grammar of the Chinese language from a language processing perspective based on naturally occurring data. Real examples of actual language use are illustrated based on a data driven and corpus based approach so that its links to computational linguist approaches for computer processing are naturally bridged in. A number of important Chinese NLP resources are presented. The second part overviews Chinese specific processing issues and corresponding computational technologies. The tutorial focuses on Chinese word segmentation with a brief introduction to Part-of-Speech tagging and some Chinese NLP applications. Word segmentation problem has to deal with some Chinese language unique problems such as unknown word detection and named entity recognition which will be the emphasis of this tutorial.

This tutorial is targeted for both Chinese linguists who are interested in computational linguistics and computer scientists who are interested in research on processing Chinese. More specifically, the expected audience comes from three groups: (1) The linguistic community - for any linguist or language scientist whose typological, comparative, or theoretical research requires understanding of the Chinese grammar and processing of Chinese text through observations to corpus data. It is also helpful for Chinese linguists, from graduate students to experts who may have good knowledge of the language to learn methods to process Chinese text data using computational means; (2) For researchers and students in computer science who are interested in doing research and development in language technology for the Chinese language; (3) For scholars in neighboring fields who work on Chinese, such as in communication, language learning and teaching technology, psychology, and sociology: for description of basic linguistic facts, and as resources for basic data.

Some basic knowledge of Chinese would be helpful. Comprehensive understanding of the language is not necessary.

Outline

Part 1: Highlights of Chinese Grammar for NLP
- 1.1 Preliminaries: Orthography and Writing Conventions
- 1.2 Basic Units of Processing: Words or Characters?
  - a. Word-forms vs. Character forms
  - b. Word-senses vs. Character-senses
- 1.3 Part-of-Speech: important issues in defining word classes
- 1.4 Word Formation: From affixation to compounding
- 1.5 Unique Constructions and Challenges
  - a. Classifier-noun agreement
  - b. Separable Compounds (or Ionization)
  - c. ‘Verbless’ Constructions
- 1.6. Chinese NLP resources
Part 2: Text Processing
- 2.1 Lexical processing
  - a. Segmentation
  - b. Disambiguation
  - c. Unknown word detection
  - d. Named Entity Recognition
- 2.2 Syntactic processing
  - a. Issues in PoS tagging
  - b. Hidden Markov Models
- 2.3 NLP Applications

Biographical information of presenters

Professor Chu-Ren Huang
Dean of Humanities, The Hong Kong Polytechnic University
Research Fellow, Academia Sinica
Churen.huang@inet.polyu.edu.hk
　
Prof. Huang is the dean of Faculty of Humanities and Chair Professor of Applied Chinese Language Studies at the Hong Kong Polytechnic University and a research fellow at the Institute of Linguistics, Academia Sinica. He received his PhD in linguistics from Cornell University in 1987 and has since played a central role in developing Chinese language resources and in leading the fields of Chinese corpus and computational linguistics. In chronological order, he directed or co-directed the construction of the CKIP lexicon and ICG Grammar (80k entries, including 40k verbs with detailed PAS annotation), Sinica Corpus (10 million words, the first tagged balance corpus of Chinese), Sinica Treebank, Academia Sinica Bilingual Ontological WordNet (direct mapping from Princeton WordNet to Chinese with Translation), Chinese WordSketch, Tagged Chinese Gigaword Corpus (tagging of the raw corpus from LDC consisting of 1,400 million characters from China, Taiwan, and Singapore), Hantology (an ontology based on the semantic classification of Chinese characters) and Chinese WordNet (over 20,000 senses currently). He has ensured that most of the above resources are available online and has spearheaded the efforts to create language learning sites, such as Adventures in Wen-Land and SouWenJieZi, to integrate these resources, as well as collaborated to develop the versatile tool of Chinese WordSketch. He also pioneered the use of corpus, especially automatically extracted collocations, in creating Chinese dictionaries with the Mandarin Daily Classifier Dictionary in 1996. He has published over 70 journal and book articles and over 280 conference papers on different aspects of Chinese linguistics. He has also edited over 14 books or journal special issues, including the just completed volume entitled Ontology and the Lexicon, to be published by Cambridge University Press in the Cambridge Studies in Natural Language Processing Series. 　
　
Prof. Qin LU,
Department of Computing,
The Hong Kong Polytechnic University
Hung Hom, Hong Kong
Tel. (852) 2766 7247
Fax (852) 2774 0842
csluqin@comp.polyu.edu.hk
　
Prof. Qin Lu is a professor at the department of Computing, The Hong Kong Polytechnic University of Hong Kong. Prof. Lu received her BS degree from Beijing Normal University in Electrical Engineering. She then studied in University of Illinois at Urbana-Champion where she received both of her M.S. and Ph.D. in computer science. She started working in the field of open systems and natural language processing from 1992. She has since worked in Chinese Processing related areas on open systems, system software development, information retrieval and standardization. Professor Lu initiated the I-Hanzix open system architecture which uses the I18N/L10N concepts to support multiple locales and also support codeset announcement at the system level. She received an Industrial Support fund in the sum of 3.67 millions to develop a Chinese information Server and the Server Access Software over the internet to make information access over the internet codeset transparent without regard to what Chinese encodings of the Chinese documents being accessed. Her work in NLP is mostly focused on using computational methods for information extraction and text mining. She has conducted research in Chinese segmentation, PoS tagging using statistical methods. She was the first to conduct extensive work on Chinese collocation extraction exploring methods and result has produced useful resources such as a Chinese collocation bank, and a shallow Chinese tree bank. Prof. Lu has worked extensively in Chinese terminology extraction and ontology construction in recent years.

T2: Topics in Statistical Machine Translation

Date/Time: Morning, 2 Aug 2009
Presenters/organisers: Kevin Knight, Philipp Koehn

Abstract

In the past, we presented tutorials called "Introduction to Statistical Machine Translation", aimed at people who know little or nothing about the field and want to get acquainted with the basic concepts. This tutorial, by contrast, goes more deeply into selected topics of intense current interest. We envision two types of participants:

1) People who understand the basic idea of statistical machine translation and want to get a survey of hot-topic current research, in terms that they can understand.

2) People associated with statistical machine translation work, who have not had time to study the most current topics in depth.

We fill the gap between the introductory tutorials that have gone before and the detailed scientific papers presented at ACL sessions.

Outline

Below is our tutorial structure. We showcase the intuitions behind the algorithms and give examples of how they work on sample data. Our selection of topics focuses on techniques that deliver proven gains in translation accuracy, and we supply empirical results from the literature.

QUICK REVIEW (15 minutes)
- - Phrase-based and syntax-based MT.
ALGORITHMS (45 minutes)
- - Efficient decoding for phrase-based and syntax-based MT (cube pruning, forward/outside costs).
- - Minimum-Bayes risk.
- - System combination.
SCALING TO LARGE DATA (30 minutes)
- - Phrase table pruning, storage, suffix arrays.
- - Large language models (distributed LMs, noisy LMs).
NEW MODELS (1 hour and 10 minutes)
- - New methods for word alignment (beyond GIZA++).
- - Factored models.
- - Maximum entropy models for rule selection and re-ordering.
- - Acquisition of syntactic translation rules.
- - Syntax-based language models and target-language dependencies.
- - Lattices for encoding source-language uncertainties.
LEARNING TECHNIQUES (20 minutes)
- - Discriminative training (perceptron, MIRA).

Biographical information of presenters

Kevin Knight
Address: 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
Email: knight@isi.edu

Dr. Knight’s research interests include machine translation, automata theory, and decipherment. He has authored numerous scientific papers on statistical machine translation, and he is active in building and deploying large-scale MT systems. Previously, he served on the editorial boards of the Computational Linguistics journal, the Journal of Artificial Intelligence Research, and the ACM Transactions on Speech andLanguage Processing. Dr. Knight is Chief Scientist at Language Weaver, Inc., and he was General Chair of the 2005 ACL conference in Ann Arbor, Michigan.

Philipp Koehn
Address: 10 Crichton Road, Edinburgh, EH8-9AB
Email: pkoehn@inf.ed.ac.uk

Dr Koehn's research interests include machine translation and its applications, as well as large-scale natural language learning. He has been a lecturer at the University of Edinburgh since 2005, after spending a year as a post-doc at MIT and receiving his PhD from the University of Southern California in 2003. He has served as area chair for machine translation at major conferences (ACL, NAACL, MT Summit) and is known for his efforts to foster open source resources for machine translation, such as the Moses decoder and the Europarl corpus.

T3: Semantic Role Labeling: Past, Present and Future

Date/Time: Morning, 2 Aug 2009
Presenter/organiser: Lluís Màrquez

Abstract

Semantic Role Labeling (SRL) consists of detecting basic event structures such as "who" did "what" to "whom", "when" and "where". The identification of such event frames holds potential for significant impact in many NLP applications, such as Information Extraction, Question Answering, Summarization and Machine Translation among others. The work on SRL has included a broad spectrum of supervised probabilistic and machine learning approaches, presenting significant advances in many directions over the last several years. However, despite all the efforts and the considerable degree of maturity of the SRL technology, the use of SRL systems in real-world applications has so far been limited and, certainly, below the initial expectations. This fact has to do with the weaknesses and limitations of current systems, which have been highlighted by many of the evaluation exercises and keep unresolved for a few years.

This tutorial has two differentiated parts. In the first one, the state-of-the-art on SRL will be overviewed, including: main techniques applied, existing systems, and lessons learned from the evaluation exercises. This part will include a critical review of current problems and the identification of the main challenges for the future. The second part is devoted to the lines of research oriented to overcome current limitations. This part will include an analysis of the relation between syntax and SRL, the development of joint systems for integrated syntactic-semantic analysis, generalization across corpora, and engineering of truly semantic features.

Outline

1. Introduction
- * Problem definition and properties
- * Importance of SRL
- * Main computational resources and systems available
2. State-of-the-art SRL systems
- * Architecture
- * Training of different components
- * Feature engineering
3. Empirical evaluation of SRL systems
- * Evaluation exercises at SemEval and CoNLL conferences
- * Main lessons learned
4. Current problems and challenges
5. Keys for future progress
- * Relation to syntax: joint learning of syntactic and semantic dependencies
- * Generalization across domains and text genres
- * Use of semantic knowledge
- * SRL systems in applications
6. Conclusions

Biographical information of presenter

Lluís Màrquez
TALP Research Center
Software Department
Technical University of Catalonia
e-mail: lluism@lsi.upc.edu
URL: http://www.lsi.upc.edu/~lluism

Ph.D. in Computer Science by the Technical University of Catalonia (UPC, 1999). Currently, he is an Associate Professor of the Software Department (LSI, UPC) lecturing at the Computer Science Faculty of Barcelona. He is also a senior researcher at the center for research in Speech and Language Technologies and Applications (TALP). His current research interests are focused on Machine Learning architectures for Natural Language structured problems, including parsing, semantic role labeling, named entity extraction, and word sense disambiguation. Regarding applications, he is working on the introduction of high-level linguistic information to Statistical Machine Translation and Oral Question Answering. He has published over
75 refereed papers on the previous topics in journals and conferences of NLP and Machine Learning areas. He was program chair of CoNLL-2006 and organizer of the SemEval-2007 semantic evaluation competition and workshop. He also organized the shared tasks on syntactic and semantic parsing at CoNLL-2004, 2005, 2008 and 2009, and leaded the teams that prepared three evaluation tasks at Senseval-3 and SemEval-2007. He has been guest editor of the special issues: "Semantic Role Labeling" and"Computational Semantic Analysis of Language" at Computational Linguistics and Language Resources and Evaluation, respectively. Currently, he acts as president of the ACL SIG on Natural Language Learning (SIGNLL) and chairs the 13th Annual Conference of the European Association for Machine Translation (EAMT-2009), and the SEW-2009 NAACL-HLT workshop, "Semantic Evaluations: Recent Achievements and Future Directions".

T4: Computational Modeling of Human Language Acquisition

Date/Time: Afternoon, 2 Aug 2009
Presenter/organiser: Afra Alishahi

Abstract

The nature and amount of information needed for learning a natural language, and the underlying mechanisms involved in this process, are the subject of much debate: is it possible to learn a language from usage data only, or some sort of innate knowledge and/or bias is needed to boost the process? This is a topic of interest to (psycho)linguists who study human language acquisition, as well as computational linguists who develop the knowledge sources necessary for large-scale natural language processing systems. Children are a source of inspiration for any such study of language learnability. They learn language with ease, and their acquired knowledge of language is flexible and robust.

Human language acquisition has been studied for centuries, but using computational modeling for such studies is a relatively recent trend. However, computational approaches to language learning have become increasing popular, mainly due to the advances in developing machine learning techniques, and the availability of vast collections of experimental data on child language learning and child-adult interaction. Many of the existing computational models attempt to study the complex task of learning a language under the cognitive plausibility criteria (such as memory and processing limitations that humans face), as well as to explain the developmental patterns observed in children. Such computational studies can provide insight into the plausible mechanisms involved in human language acquisition, and be a source of inspiration for developing better language models and techniques.

This tutorial will review the main research questions that the researchers in the field of computational language acquisition are concerned with, as well as the common approaches and techniques in developing these models. Computational modeling has been vastly applied to different domains of language acquisition, including word segmentation and phonology, morphology, syntax, semantics and discourse. However, due to time restrictions, the focus of the tutorial will be on the acquisition of word meaning, syntax, and the link between syntax and semantics.

Outline

* Computational Psycholinguistics and NLP
- - Overview: computational modeling and experimental observations
- - Evaluation of computational models
* Human Language Acquisition
- - Modularity
- - Learnability and innateness
- - Available collections of child experimental data
* Computational Studies of Learning Word Meaning
- - Word learning as constraint satisfaction
- - Probabilistic simulation of developmental patterns
* Computational Models of Syntax Acquisition
- - Symbolic accounts of grammar acquisition
- - Connectionist models of learning linguistic structures
- - Probabilistic grammar induction from text data
* Relation between Syntax and Semantics
- - Verb argument structure and linking rules
- - Connectionist models of linking syntax and semantics
- - Computational construction-based approaches
- - Bayesian models of argument structure acquisition

Biographical information of presenter

Afra Alishahi (afra@coli.uni-sb.de)
Computational Psycholinguistics Group,
Department of Computational Linguistics and Phonetics,
Saarland University, Germany

Afra Alishahi received her PhD from the Computer Science Department, University of Toronto, Canada where she was a member of the Computational Linguistics group. She is now a postdoctoral fellow at the Computational Psycholinguistics group in the Department of Computational Linguistics and Phonetics, Saarland University. She has been working on the probabilistic modeling of various aspects of child language acquisition, including verb argument structure, word meaning, verb semantic roles and selectional
preferences.

T5: Learning to Rank

Date/Time: Afternoon, 2 Aug 2009
Presenter/organiser: Hang Li

Short Abstract

In this tutorial I will introduce `learning to rank', a machine learning technology on constructing a model for ranking objects using training data. I will first explain the problem formulation of learning to rank, and relations between learning to rank and the other learning tasks. I will then describe learning to rank methods developed in recent years, including pointwise, pairwise, and listwise approaches. I will then give an introduction to the theoretical work on learning to rank and the applications of learning to rank. Finally, I will show some future directions of research on learning to rank. The goal of this tutorial is to give the audience a comprehensive survey to the technology and stimulate more research on the technology and application of the technology to natural language processing.

Learning to rank has been successfully applied to information retrieval and is potentially useful for natural language processing as well. In fact many NLP tasks can be formalized as ranking problems and NLP technologies may be significantly improved by using learning to rank techniques. These include question answering, summarization, and machine translation. For example, in machine translation, given a sentence in the source language, we are to translate it to a sentence in the target language. Usually there are multiple possible translations and it would be better to sort the possible translations in descending order of their likelihood and output the sorted results. Learning to rank can be employed in the task.

1. Introduction
2. Learning to Rank Problem
- a) Problem Formulation
- b) Evaluation
3. Learning to Rank Methods
- a) Pointwise Approach
  - i. McRank
- b) Pairwise Approach
  - i. Ranking SVM
  - ii. RankBoost
  - iii. RankNet
  - iv. IR SVM
- c) Listwise Approach:
  - i. ListNet
  - ii. ListMLE
  - iii. AdaRank
  - iv. SVM Map
  - v. PermuRank
  - vi. SoftRank
- d) Other Methods
4. Learning to Rank Theory
- a) Pairwise Approach
  - i. Generalization Analysis
- b) Listwise Approach
  - i. Generalization Analysis
  - ii. Consistency Analysis
5. Learning to Rank Applications
- a) Search Ranking
- b) Collaborative Filtering
- c) Key Phrase Extraction
- d) Potential Applications in Natural Language Processing
6. Future Directions for Learning to Rank Research
7. Conclusion

Biographical information of presenter

Hang Li
Microsoft Research Asia
Email: hangli@microsoft.com
Homepage: http://research.microsoft.com/en-us/people/hangli/

Hang Li is senior researcher and research manager in the Information Retrieval and Mining Group at Microsoft Research Asia. He is also adjunct professor at Peking University, Nanjing University, Xian Jiaotong University, and Nankai University. His research areas include natural language processing, information retrieval, statistical machine learning, and data mining. He graduated from Kyoto University and earned his PhD from the University of Tokyo. Hang Li has been working on learning to rank and its applications. He has 15 publications on the topic at SIGIR, ICML, and other top conferences. He co-organized two workshops on learning to rank at SIGIR’07 and SIGIR’08. The group which he is leading is viewed as one of the most active research groups in this area.

T6: State-of-the-art NLP approaches to coreference resolution: theory and practical recipes.

Date/Time: Afternoon, 2 Aug 2009
Presenter/organisers: Simone Paolo Ponzetto, Massimo Poesio

Abstract

The identification of different nominal phrases in a discourse as used to refer to the same (discourse) entity is essential for achieving robust natural language understanding (NLU). The importance of this task is directly amplified by the field of Natural Language Processing (NLP) currently moving towards high-level linguistic tasks requiring NLU capabilities such as e.g. recognizing textual entailment. This tutorial aims at providing the NLP community with a gentle introduction to the task of coreference resolution from both a theoretical and an application-oriented perspective. Its main purposes are: (1) to introduce a general audience of NLP researchers to the core ideas underlying state-of-the-art computational models of coreference; (2) to provide that same audience with an overview of NLP applications which can benefit from coreference information.

Outline

The tutorial is divided into three main parts:

1. Introduction to machine learning approaches to coreference resolution.

We start by focusing on machine learning based approaches developed in the seminal works from Soon et al. (2001) and Ng & Cardie (2002). We then analyze the main limitations of these approaches, i.e. their clustering of mentions from a local pairwise classification of nominal phrases in text. We finally move on to present more complex models which attempt to model coreference as a global discourse phenomenon (Yang et al., 2003; Luo et al., 2004; Daume & Marcu, 2005; inter alia).

2. Lexical and encyclopedic knowledge for coreference resolution.

Resolving anaphors to their correct antecedents requires in many cases lexical and encyclopedic knowledge. We accordingly introduce approaches which attempt to include semantic information into the coreference models from a variety of knowledge sources, e.g. WordNet (Harabagiu et al., 2001), Wikipedia (Ponzetto & Strube, 2006) and automatically harvested patterns (Poesio et al., 2002; Markert & Nissim, 2005; Yang & Su, 2007).

3. Applications and future directions.

We present an overview of NLP applications which have been shown to profit from coreference information, e.g. question answering and summarization. We conclude with remarks on future work directions. These include: a) bringing together approaches to coreference using semantic information with global discourse modeling techniques; b) exploring novel application scenarios which could potentially benefit from coreference resolution, e.g. relation extraction and extracting events and event chains from text.

Target audience

This tutorial is designed for students and researchers in Computer Science and Computational Linguistics. No prior knowledge of coreference topics is assumed.

Biographical information of presenters

Simone Paolo Ponzetto is an assistant professor at the Computational Linguistics Department of the University of Heidelberg, Germany. His main research interests lie in the areas of information extraction, knowledge acquisition and engineering, lexical semantics, and their application to discourse-based phenomena.

Massimo Poesio is Chair in Humanities Computing at the University of Trento and Director of the Language Interaction and Computation Lab at the Center for Mind / Brain Sciences. He has participated in the development of state-of-the-art systems for anaphora resolution such as GUITAR, which have been applied to tasks such as summarization and information extraction. He coordinated the 2007 Johns Hopkins Workshop on Using Lexical and Encyclopedic Knowledge for Entity Disambiguation, that led to the development of the BART system.

Tutorials Co-Chairs

Diana McCarthy, University of Sussex, UK

Chengqing Zong, Institute of Automation, Chinese Academy of Sciences (CASIA), China

Please send inquiries concerning ACL-IJCNLP 09 tutorials to tutorials-acl09 "at" sussex "dot" ac "dot" uk

Reference