Tutorials
ACL-IJCNLP 2009 Tutorials at a Glance, 2 August, 2009 (Sunday), Suntec Level 2 |
|
MR202 |
MR208 |
MR209 |
08:30-10:00 |
T1 |
T2 |
T3 |
10:00:10:30 |
Coffee/Tea Break |
10:30-12:00 |
T1 |
T2 |
T3 |
12:00-14:00 |
Lunch |
14:00-15:30 |
T4 |
T5 |
T6 |
15:30-16:00 |
Coffee/Tea Break |
16:00-17:30 |
T4 |
T5 |
T6 |
|
T1:
Fundamentals of Chinese Language Processing
Date/Time: Morning, 2 Aug 2009
Presenters/organisers: Chu-Ren Huang and Prof. Qin LU
Abstract
This tutorial gives an introduction to the fundamentals of Chinese language processing for text processing. Computer processing of Chinese text requires the understanding of both the language itself and the technology to handle them. The tutorial contains two parts. The first part overviews the grammar of the Chinese language from a language processing perspective based on naturally occurring data. Real examples of actual language use are illustrated based on a data driven and corpus based approach so that its links to computational linguist approaches for computer processing are naturally bridged in. A number of important Chinese NLP resources are presented. The second part overviews Chinese specific processing issues and corresponding computational technologies. The tutorial focuses on Chinese word segmentation with a brief introduction to Part-of-Speech tagging and some Chinese NLP applications. Word segmentation problem has to deal with some Chinese language unique problems such as unknown word detection and named entity recognition which will be the emphasis of this tutorial.
This tutorial is targeted for both Chinese linguists who are interested in computational linguistics and computer scientists who are interested in research on processing Chinese. More specifically, the expected audience comes from three groups: (1) The linguistic community - for any linguist or language
scientist whose typological, comparative, or theoretical research requires understanding of the Chinese grammar and processing of Chinese text through observations to corpus data. It is also helpful for Chinese linguists, from graduate students to experts who may have good knowledge of the language to learn methods to process Chinese text data using computational means; (2) For researchers and students in computer science who are interested in doing research and development in language technology for the Chinese language; (3) For scholars in neighboring fields who work on Chinese, such as in communication, language learning and teaching technology, psychology, and sociology: for description of basic linguistic facts, and as resources for basic data.
Some basic knowledge of Chinese would be helpful. Comprehensive understanding of the language is not necessary.
Outline
- Part 1: Highlights of Chinese Grammar for NLP
- 1.1 Preliminaries: Orthography and Writing Conventions
- 1.2 Basic Units of Processing: Words or Characters?
- a. Word-forms vs. Character forms
- b. Word-senses vs. Character-senses
- 1.3 Part-of-Speech: important issues in defining word classes
- 1.4 Word Formation: From affixation to compounding
- 1.5 Unique Constructions and Challenges
- a. Classifier-noun agreement
- b. Separable Compounds (or Ionization)
- c. ‘Verbless’ Constructions
- 1.6. Chinese NLP resources
- Part 2: Text Processing
- 2.1 Lexical processing
- a. Segmentation
- b. Disambiguation
- c. Unknown word detection
- d. Named Entity Recognition
- 2.2 Syntactic processing
- a. Issues in PoS tagging
- b. Hidden Markov Models
- 2.3 NLP Applications
Biographical information of presenters
Professor Chu-Ren Huang
Dean of Humanities, The Hong Kong Polytechnic University
Research Fellow, Academia Sinica
Churen.huang@inet.polyu.edu.hk
Prof. Huang is the dean of Faculty of Humanities and Chair Professor of Applied Chinese Language Studies at the Hong Kong Polytechnic University and a research fellow at the Institute of Linguistics, Academia Sinica. He received his PhD in linguistics from Cornell University in 1987 and has since played a central role in developing Chinese language resources and in leading the fields of Chinese corpus and computational linguistics. In chronological order, he directed or co-directed the construction of the CKIP lexicon and ICG Grammar (80k entries, including 40k verbs with detailed PAS annotation), Sinica Corpus (10 million words, the first tagged balance corpus of Chinese), Sinica Treebank, Academia Sinica Bilingual Ontological WordNet (direct mapping from Princeton WordNet to Chinese with Translation), Chinese WordSketch, Tagged Chinese Gigaword Corpus (tagging of the raw corpus from LDC consisting of 1,400 million characters from China, Taiwan, and Singapore), Hantology (an ontology based on the semantic classification of Chinese characters) and Chinese WordNet (over 20,000 senses currently). He has ensured that most of the above resources are available online and has spearheaded the efforts to create language learning sites, such as Adventures in Wen-Land and SouWenJieZi, to integrate these resources, as well as collaborated to develop the versatile tool of Chinese WordSketch. He also pioneered the use of corpus, especially automatically extracted collocations, in creating Chinese dictionaries with the Mandarin Daily Classifier Dictionary in 1996. He has published over 70 journal and book articles and over 280 conference papers on different aspects of Chinese linguistics. He has also edited over 14 books or journal special issues, including the just completed volume entitled Ontology and the Lexicon, to be published by Cambridge University Press in the Cambridge Studies in Natural Language Processing Series.
Prof. Qin LU,
Department of Computing,
The Hong Kong Polytechnic University
Hung Hom, Hong Kong
Tel. (852) 2766 7247
Fax (852) 2774 0842
csluqin@comp.polyu.edu.hk
Prof. Qin Lu is a professor at the department of Computing, The Hong Kong Polytechnic University of Hong Kong. Prof. Lu received her BS degree from Beijing Normal University in Electrical Engineering. She then studied in University of Illinois at Urbana-Champion where she received both of her M.S. and Ph.D. in computer science. She started working in the field of open systems and natural language processing from 1992. She has since worked in Chinese Processing related areas on open systems, system software development, information retrieval and standardization. Professor Lu initiated the I-Hanzix open system architecture which uses the I18N/L10N concepts to support multiple locales and also support codeset announcement at the system level. She received an Industrial Support fund in the sum of 3.67 millions to develop a Chinese information Server and the Server Access Software over the internet to make information access over the internet codeset transparent without regard to what Chinese encodings of the Chinese documents being accessed. Her work in NLP is mostly focused on using computational methods for information extraction and text mining. She has conducted research in Chinese segmentation, PoS tagging using statistical methods. She was the first to conduct extensive work on Chinese collocation extraction exploring methods and result has produced useful resources such as a Chinese collocation bank, and a shallow Chinese tree bank. Prof. Lu has worked extensively in Chinese terminology extraction and ontology construction in recent years. |
T2: Topics in Statistical Machine Translation
Date/Time: Morning, 2 Aug 2009
Presenters/organisers: Kevin Knight, Philipp Koehn
Abstract
In the past, we presented tutorials called "Introduction to Statistical Machine Translation", aimed at people who know little or nothing about the field and want to get acquainted with the basic concepts. This tutorial, by contrast, goes more deeply into selected topics of intense current interest. We envision two types of participants:
1) People who understand the basic idea of statistical machine translation and want to get a survey of hot-topic current research, in terms that they can understand.
2) People associated with statistical machine translation work, who have not had time to study the most current topics in depth.
We fill the gap between the introductory tutorials that have gone before and the detailed scientific papers presented at ACL sessions.
Outline
Below is our tutorial structure. We showcase the intuitions behind the algorithms and give examples of how they work on sample data. Our selection of topics focuses on techniques that deliver proven gains in translation accuracy, and we supply empirical results from the literature.
- QUICK REVIEW (15 minutes)
- - Phrase-based and syntax-based MT.
- ALGORITHMS (45 minutes)
- - Efficient decoding for phrase-based and syntax-based MT (cube pruning, forward/outside costs).
- - Minimum-Bayes risk.
- - System combination.
- SCALING TO LARGE DATA (30 minutes)
- - Phrase table pruning, storage, suffix arrays.
- - Large language models (distributed LMs, noisy LMs).
- NEW MODELS (1 hour and 10 minutes)
- - New methods for word alignment (beyond GIZA++).
- - Factored models.
- - Maximum entropy models for rule selection and re-ordering.
- - Acquisition of syntactic translation rules.
- - Syntax-based language models and target-language dependencies.
- - Lattices for encoding source-language uncertainties.
- LEARNING TECHNIQUES (20 minutes)
- - Discriminative training (perceptron, MIRA).
Biographical information of presenters
Kevin Knight
Address: 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
Email: knight@isi.edu
Dr. Knight’s research interests include machine translation, automata theory, and decipherment. He has authored numerous scientific papers on statistical machine translation, and he is active in building and deploying large-scale MT systems. Previously, he served on the editorial boards of the Computational Linguistics journal, the Journal of Artificial Intelligence Research, and the ACM Transactions on Speech andLanguage Processing. Dr. Knight is Chief Scientist at Language Weaver, Inc., and he was General Chair of the 2005 ACL conference in Ann Arbor, Michigan.
Philipp Koehn
Address: 10 Crichton Road, Edinburgh, EH8-9AB
Email: pkoehn@inf.ed.ac.uk
Dr Koehn's research interests include machine translation and its applications, as well as large-scale natural language learning. He has been a lecturer at the University of Edinburgh since 2005, after spending a year as a post-doc at MIT and receiving his PhD from the University of Southern California in 2003. He has served as area chair for machine translation at major conferences (ACL, NAACL, MT Summit) and is known for his efforts to foster open source resources for machine translation, such as the Moses decoder and the Europarl corpus. |
T3: Semantic Role Labeling: Past, Present and Future
Date/Time: Morning, 2 Aug 2009
Presenter/organiser: Lluís Màrquez
Abstract
Semantic Role Labeling (SRL) consists of detecting basic event
structures such as "who" did "what" to "whom", "when" and "where". The
identification of such event frames holds potential for significant
impact in many NLP applications, such as Information Extraction,
Question Answering, Summarization and Machine Translation among
others. The work on SRL has included a broad spectrum of supervised
probabilistic and machine learning approaches, presenting significant
advances in many directions over the last several years. However,
despite all the efforts and the considerable degree of maturity of the
SRL technology, the use of SRL systems in real-world applications has
so far been limited and, certainly, below the initial expectations.
This fact has to do with the weaknesses and limitations of current
systems, which have been highlighted by many of the evaluation
exercises and keep unresolved for a few years.
This tutorial has two differentiated parts. In the first one, the
state-of-the-art on SRL will be overviewed, including: main techniques
applied, existing systems, and lessons learned from the evaluation
exercises. This part will include a critical review of current
problems and the identification of the main challenges for the
future. The second part is devoted to the lines of research oriented
to overcome current limitations. This part will include an analysis of
the relation between syntax and SRL, the development of joint systems
for integrated syntactic-semantic analysis, generalization across
corpora, and engineering of truly semantic features.
Outline
- 1. Introduction
- * Problem definition and properties
- * Importance of SRL
- * Main computational resources and systems available
- 2. State-of-the-art SRL systems
- * Architecture
- * Training of different components
- * Feature engineering
- 3. Empirical evaluation of SRL systems
- * Evaluation exercises at SemEval and CoNLL conferences
- * Main lessons learned
- 4. Current problems and challenges
- 5. Keys for future progress
- * Relation to syntax: joint learning of syntactic and semantic dependencies
- * Generalization across domains and text genres
- * Use of semantic knowledge
- * SRL systems in applications
- 6. Conclusions
Biographical information of presenter
Lluís Màrquez
TALP Research Center
Software Department
Technical University of Catalonia
e-mail: lluism@lsi.upc.edu
URL: http://www.lsi.upc.edu/~lluism
Ph.D. in Computer Science by the Technical University of Catalonia
(UPC, 1999). Currently, he is an Associate Professor of the Software
Department (LSI, UPC) lecturing at the Computer Science Faculty of
Barcelona. He is also a senior researcher at the center for research
in Speech and Language Technologies and Applications (TALP). His
current research interests are focused on Machine Learning
architectures for Natural Language structured problems, including
parsing, semantic role labeling, named entity extraction, and word
sense disambiguation. Regarding applications, he is working on the
introduction of high-level linguistic information to Statistical
Machine Translation and Oral Question Answering. He has published over
75 refereed papers on the previous topics in journals and conferences
of NLP and Machine Learning areas. He was program chair of CoNLL-2006
and organizer of the SemEval-2007 semantic evaluation competition and
workshop. He also organized the shared tasks on syntactic and semantic
parsing at CoNLL-2004, 2005, 2008 and 2009, and leaded the teams that
prepared three evaluation tasks at Senseval-3 and SemEval-2007. He has
been guest editor of the special issues: "Semantic Role Labeling" and"Computational Semantic Analysis of Language" at Computational
Linguistics and Language Resources and Evaluation, respectively.
Currently, he acts as president of the ACL SIG on Natural Language
Learning (SIGNLL) and chairs the 13th Annual Conference of the
European Association for Machine Translation (EAMT-2009), and the
SEW-2009 NAACL-HLT workshop, "Semantic Evaluations: Recent
Achievements and Future Directions". |
T4: Computational Modeling of Human Language Acquisition
Date/Time: Afternoon, 2 Aug 2009
Presenter/organiser: Afra Alishahi
Abstract
The nature and amount of information needed for learning a natural language, and the underlying mechanisms involved in this process, are the subject of much debate: is it possible to learn a language from usage data only, or some sort of innate knowledge and/or bias is needed to boost the process? This is a topic of interest to (psycho)linguists who study human language acquisition, as well as computational linguists who develop the knowledge sources necessary for large-scale natural language processing systems. Children are a source of inspiration for any such study of language learnability. They learn language with ease, and their acquired knowledge of language is flexible and robust.
Human language acquisition has been studied for centuries, but using computational modeling for such studies is a relatively recent trend. However, computational approaches to language learning have become increasing popular, mainly due to the advances in developing machine learning techniques, and the availability of vast collections of experimental data on child language learning and child-adult
interaction. Many of the existing computational models attempt to study the complex task of learning a language under the cognitive plausibility criteria (such as memory and processing limitations that humans face), as well as to explain the developmental patterns observed in children. Such computational studies can provide insight into the plausible mechanisms involved in human language acquisition, and be a source of inspiration for developing better language models and techniques.
This tutorial will review the main research questions that the researchers in the field of computational language acquisition are concerned with, as well as the common approaches and techniques in developing these models. Computational modeling has been vastly applied to different domains of language acquisition, including word segmentation and phonology, morphology, syntax, semantics and discourse. However, due to time restrictions, the focus of the tutorial will be on the acquisition of word meaning, syntax, and the link between syntax and semantics.
Outline
- * Computational Psycholinguistics and NLP
- - Overview: computational modeling and experimental observations
- - Evaluation of computational models
- * Human Language Acquisition
- - Modularity
- - Learnability and innateness
- - Available collections of child experimental data
- * Computational Studies of Learning Word Meaning
- - Word learning as constraint satisfaction
- - Probabilistic simulation of developmental patterns
- * Computational Models of Syntax Acquisition
- - Symbolic accounts of grammar acquisition
- - Connectionist models of learning linguistic structures
- - Probabilistic grammar induction from text data
- * Relation between Syntax and Semantics
- - Verb argument structure and linking rules
- - Connectionist models of linking syntax and semantics
- - Computational construction-based approaches
- - Bayesian models of argument structure acquisition
Biographical information of presenter
Afra Alishahi (afra@coli.uni-sb.de)
Computational Psycholinguistics Group,
Department of Computational
Linguistics and Phonetics,
Saarland University, Germany
Afra Alishahi received her PhD from the Computer
Science Department, University of Toronto, Canada where she was a
member of the Computational Linguistics group. She is now a
postdoctoral fellow at the Computational Psycholinguistics group in
the Department of Computational Linguistics and Phonetics, Saarland
University. She has been working on the probabilistic modeling of
various aspects of child language acquisition, including verb argument
structure, word meaning, verb semantic roles and selectional
preferences. |
T5: Learning to Rank
Date/Time: Afternoon, 2 Aug 2009
Presenter/organiser: Hang Li
Short Abstract
In this tutorial I will introduce `learning to rank', a machine learning technology on constructing a model for ranking objects using training data. I will first explain the problem formulation of learning to rank, and relations between learning to rank and the other learning tasks. I will then describe learning to rank methods developed in recent years, including pointwise, pairwise, and listwise approaches. I will then give an introduction to the theoretical work on learning to rank and the applications of learning to rank. Finally, I will show some future directions of research on learning to rank. The goal of this tutorial is to give the audience a comprehensive survey to the technology and stimulate more research on the technology and application of the technology to natural language processing.
Learning to rank has been successfully applied to information retrieval and is potentially useful for natural language processing as well. In fact many NLP tasks can be formalized as ranking problems and NLP technologies may be significantly improved by using learning to rank techniques. These include question answering, summarization, and machine translation. For example, in machine translation, given a sentence in the source language, we are to translate it to a sentence in the target language. Usually there are multiple possible translations and it would be better to sort the possible translations in descending order of their likelihood and output the sorted results. Learning to rank can be employed in the task.
- 1. Introduction
- 2. Learning to Rank Problem
- a) Problem Formulation
- b) Evaluation
- 3. Learning to Rank Methods
- a) Pointwise Approach
- b) Pairwise Approach
- i. Ranking SVM
- ii. RankBoost
- iii. RankNet
- iv. IR SVM
- c) Listwise Approach:
- i. ListNet
- ii. ListMLE
- iii. AdaRank
- iv. SVM Map
- v. PermuRank
- vi. SoftRank
- d) Other Methods
- 4. Learning to Rank Theory
- a) Pairwise Approach
- i. Generalization Analysis
- b) Listwise Approach
- i. Generalization Analysis
- ii. Consistency Analysis
- 5. Learning to Rank Applications
- a) Search Ranking
- b) Collaborative Filtering
- c) Key Phrase Extraction
- d) Potential Applications in Natural Language Processing
- 6. Future Directions for Learning to Rank Research
- 7. Conclusion
Biographical information of presenter
Hang Li
Microsoft Research Asia
Email: hangli@microsoft.com
Homepage: http://research.microsoft.com/en-us/people/hangli/
Hang Li is senior researcher and research manager in the Information Retrieval and Mining Group at Microsoft Research Asia. He is also adjunct professor at Peking University, Nanjing University, Xian Jiaotong University, and Nankai University. His research areas include natural language processing, information retrieval, statistical machine learning, and data mining. He graduated from Kyoto University and earned his PhD from the University of Tokyo. Hang Li has been working on learning to rank and its applications. He has 15 publications on the topic at SIGIR, ICML, and other top conferences. He co-organized two workshops on learning to rank at SIGIR’07 and SIGIR’08. The group which he is leading is viewed as one of the most active research groups in this area.
|
T6: State-of-the-art NLP approaches to coreference resolution: theory and practical recipes.
Date/Time: Afternoon, 2 Aug 2009
Presenter/organisers: Simone Paolo Ponzetto, Massimo Poesio
Abstract
The identification of different nominal phrases in a discourse as used
to refer to the same (discourse) entity is essential for achieving
robust natural language understanding (NLU). The importance of this
task is directly amplified by the field of Natural Language Processing
(NLP) currently moving towards high-level linguistic tasks requiring
NLU capabilities such as e.g. recognizing textual entailment. This
tutorial aims at providing the NLP community with a gentle
introduction to the task of coreference resolution from both a
theoretical and an application-oriented perspective. Its main purposes
are: (1) to introduce a general audience of NLP researchers to the
core ideas underlying state-of-the-art computational models of
coreference; (2) to provide that same audience with an overview of NLP
applications which can benefit from coreference information.
Outline
The tutorial is divided into three main parts:
1. Introduction to machine learning approaches to coreference
resolution.
We start by focusing on machine learning based approaches developed
in the seminal works from Soon et al. (2001) and Ng & Cardie (2002).
We then analyze the main limitations of these approaches, i.e. their
clustering of mentions from a local pairwise classification of
nominal phrases in text. We finally move on to present more complex
models which attempt to model coreference as a global discourse
phenomenon (Yang et al., 2003; Luo et al., 2004; Daume & Marcu,
2005; inter alia).
2. Lexical and encyclopedic knowledge for coreference resolution.
Resolving anaphors to their correct antecedents requires
in many cases lexical and encyclopedic knowledge. We accordingly
introduce approaches which attempt to include semantic information
into the coreference models from a variety of knowledge sources,
e.g. WordNet (Harabagiu et al., 2001), Wikipedia (Ponzetto & Strube,
2006) and automatically harvested patterns (Poesio et al., 2002;
Markert & Nissim, 2005; Yang & Su, 2007).
3. Applications and future directions.
We present an overview of NLP applications which have been shown to
profit from coreference information, e.g. question answering and
summarization. We conclude with remarks on future work
directions. These include: a) bringing together approaches to
coreference using semantic information with global discourse
modeling techniques; b) exploring novel application scenarios which
could potentially benefit from coreference resolution, e.g. relation
extraction and extracting events and event chains from text.
Target audience
This tutorial is designed for students and researchers in Computer
Science and Computational Linguistics. No prior knowledge of
coreference topics is assumed.
Biographical information of presenters
Simone Paolo Ponzetto is an assistant professor at the Computational
Linguistics Department of the University of Heidelberg, Germany. His
main research interests lie in the areas of information extraction,
knowledge acquisition and engineering, lexical semantics, and their
application to discourse-based phenomena.
Massimo Poesio is Chair in Humanities Computing at the University of
Trento and Director of the Language Interaction and Computation Lab at
the Center for Mind / Brain Sciences. He has participated in the
development of state-of-the-art systems for anaphora resolution such
as GUITAR, which have been applied to tasks such as summarization and
information extraction. He coordinated the 2007 Johns Hopkins Workshop
on Using Lexical and Encyclopedic Knowledge for Entity Disambiguation,
that led to the development of the BART system.
|
Tutorials Co-Chairs
-
Diana McCarthy, University of Sussex, UK
-
Chengqing Zong, Institute of Automation, Chinese Academy of Sciences
(CASIA), China
Please send inquiries concerning ACL-IJCNLP 09 tutorials to tutorials-acl09 "at" sussex "dot" ac "dot" uk
Reference
A copy of Call-for-tutorials for reference.
|