The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Held at the Portland Marriott Downtown Waterfront in
Portland, Oregon, USA, June 19-24, 2011


Formal and Empirical Grammatical Inference

PRESENTERS: Jeffrey Heinz, Colin de la Higuera and Menno van Zaanen ABSTRACT: Computational linguistics (CL) often faces learning problems: what algorithm takes as input some finite dataset, such as (annotated) corpora, and outputs a system that behaves ``correctly'' on some task? Data chunking, dependency parsing, named entity recognition, part-of-speech tagging, semantic role labelling, and the more general problem of deciding which out of all possible (structured) strings are grammatical are all learning problems in this sense. The subfield of theoretical computer science known as grammatical inference (GI) studies how grammars can be obtained from data. Although there has been limited interaction between the fields of GI and CL to date, research by the GI community directly impacts the study of human language and in particular CL. The purpose of this tutorial is to introduce GI to computational linguists. This tutorial shows how the theories, algorithms, and models studied by the GI community can benefit CL. Particular attention is paid to both foundational issues (e.g. how learning ought to be defined), which are relevant to all learning problems even those outside CL, and issues specific to problems within CL. One theme that runs throughout the tutorial is how properties of natural languages can be used to reduce the instance space of the learning problem, which can lead to provably correct learning solutions. E.g. some natural language patterns are mildly context-sensitive (MCS), but many linguists agree that not all MCS languages are possible natural languages. So instead of trying to learn all MCS languages (or context-free, regular or stochastic variants thereof), one approach is to try to define a smaller problem space that better approximates natural languages. Interestingly, many of the regions that the GI community has carved out for natural language appear to have the right properties for feasible learning even under the most stringent definitions of learning. OUTLINE: Formal GI and learning theory. Those learnability properties which should be regarded as necessary (not sufficient) conditions for "good" language learning are discussed. Additionally, the most important proofs that use similar algorithmic ideas will be described. Empirical approaches to regular and sub-regular natural language classes. This focuses on sub-regular classes that are learnable under many (including the hardest) settings. General learning strategies are introduced as well as probabilistic variants, which are illustrated with natural language examples, primarily in the domain of phonology. Empirical approaches to non-regular natural language classes. This treats learnability of context-free/sensitive formalisms where the aim is to approximate the grammar of a specific language from data, not to show learnability of classes of languages. An overview of well-known empirical grammatical inference systems will be given in the context of natural language syntax and semantics. PRESENTER BIOS: Jeffrey Heinz Department of Linguistics and Cognitive Science University of Delaware Email: heinz@udel.edu Jeffrey Heinz's research is located at the intersection of theoretical linguistics, theoretical computer science, computational linguistics, grammatical inference and cognitive science. He has authored or co-authored articles in these areas for the journals Phonology, Linguistic Inquiry, and the Journal of Child Language; chapters in the forthcoming Blackwell Companion to Phonology and Cambridge Handbook of Developmental Linguistics; and papers in the proceedings of the ACL, CONLL, ICGI, MOL, and SIGMORPHON. His current research centers on the hypotheses that phonology is subregular and that phonological learning is modular. He holds a BS in Mathematics and a BA in Linguistics from the University of Maryland, College Park, and a PhD in Linguistics from the University of California, Los Angeles. Jeffrey Heinz is currently an assistant professor at the University of Delaware. Colin de la Higuera Laboratoire LINA UMR CNRS 6241 Email: Colin.Delahiguera@univ-nantes.fr Colin de la Higuera got his PhD at Bordeaux University, France and has been associate Professor at the University of Montpellier, Professor at Saint-Etienne University and is now Professor at Nantes University. His chief interest lies in grammatical inference; in this area he is the author of more than 50 reviewed research papers and of a monograph, 'Grammatical Inference: Learning Automata and Grammars', published in 2010. He has been chairman of the International Community in Grammatical Inference (2002-2007) and is now in charge of the curriculum development programme in the European Network PASCAL 2. He has also given tutorials in grammar induction at ICML 1999, CAP 2001, ECML/PKDD 2003, IJCAI 2005, ICML 2006 and at different summer schools in machine learning and computational linguistics. Menno van Zaanen Tilburg centre for Cognition and Communication Department of Communication and Information Sciences Tilburg University Email: mvzaanen@uvt.nl Menno van Zaanen is an assistant professor at Department of Communication and Information Sciences at Tilburg University. He has co-organized machine translation/grammatical inference competitions, Tenjinno and Omphalos. Menno has been chairman of the ICGI community from 2007 until 2010 and has co-organized the ICGI 2002 conference held in Amsterdam in addition to workshops (HCSNet data workshop 2005, Workshop on Grammatical Inference Applications (IJCAI 2005), ECML/PKDD Workshop and Tutorial on Learning Context-Free Grammars 2003). His main research aim is to apply symbolic machine learning to computational linguistics. Based on his PhD research on learning context-free grammars in a computational linguistic setting, he has applied similar techniques to question answering, sentence classification, learning structure in music, and machine translation. His work on Alignment-Based Learning (ABL) has been adopted by several researchers, not only in empirical learning syntax, but also in machine translation and learning of morphology and has led to further research in formal grammatical inference.



acl2011.conference@gmail.com   ♦   Oregon Health & Science University