Tutorial: Structured sparsity in natural language processing: Models, algorithms and applications

Sunday 27 April, afternoon

Presenters: André F. T. Martins, Mário A. T. Figueiredo, Noah A. Smith, and Dani Yogatama

Tutorial contents

This tutorial will cover recent advances in sparse modeling with diverse applications in natural language processing (NLP). A sparse model is one that uses a relatively small number of features to map an input to an output, such as a label sequence or parse tree. The advantages of sparsity are, among others, compactness and interpretability; in fact, sparsity is currently a major theme in statistics, machine learning, and signal processing. The goal of sparsity can be seen in terms of earlier goals of feature selection and therefore model selection (Della Pietra et al., 1997; Guyon and Elisseeff, 2003; McCallum, 2003).

This tutorial will focus on methods which embed sparse model selection into the parameter estimation problem. In such methods, learning is carried out by minimizing a regularized empirical risk functional composed of two terms: a "loss term," which controls the goodness of fit to the data (e.g., log loss or hinge loss), and a "regularizer term," which is designed to promote sparsity.

The simplest example is L1­norm regularization (Tibshirani, 1996), which penalizes weight components individually, and has been explored in various NLP applications (Kazama and Tsujii, 2003; Goodman, 2004). More sophisticated regularizers, those that use mixed norms and groups of weights, are able to promote "structured" sparsity: i.e., they promote sparsity patterns that are compatible with a priori knowledge about the structure of the feature space. This kind of regularizers has been proposed in the statistical and signal processing literature (Yuan and Lin, 2006; Zhao et al., 2009; Bach et al., 2011) and is a recent topic of research in NLP (Eisenstein et al., 2011; Martins et al, 2011, Das and Smith, 2012, Nelakanti et al. 2013). Some regularizers are even able to encourage structured sparsity, without prior knowledge about this structure (Bondell et al., 2007; Zeng et al., 2013). Sparsity­inducing regularizers require the use of specialized optimization routines for learning (Bach et al., 2011, Wright et al., 2009; Xiao, 2009; Langford et al., 2009).

The tutorial will consist of three parts: (1) how to formulate the problem, i.e., how to choose the right regularizer for the kind of sparsity pattern intended; (2) how to solve the optimization problem efficiently; and (3) examples of the use of sparsity within natural language processing problems.

Biography

André F. T. Martins is a research scientist at Priberam Labs. He received his dual­degree PhD in Language Technologies in 2012 from Carnegie Mellon University and Instituto Superior Técnico. His PhD dissertation was awarded Honorable Mention in CMU’s SCS Dissertation Award competition. Martins’ research interests include natural
language processing, machine learning, structured prediction, sparse modeling, and optimization. His paper “Concise Integer Linear Programming Formulations for Dependency Parsing” received a best paper award at ACL 2009.

Mário A. T. Figueiredo is a professor of electrical and computer engineering at Instituto Superior Técnico (the engineering school of the Technical University of Lisbon) and his main research interests include machine learning, statistical signal processing, and optimization. He recently guest co­edited a special issue of the IEEE Journal on Special Topics in Signal Processing devoted to compressive sensing (one of the central areas of research on sparsity) and gave several invited talks (and a tutorial at ICASSP 2012) on optimization algorithms for problems involving sparsity.

Noah A. Smith is Finmeccanica associate professor in language technologies and machine learning at CMU. His research interests include statistical natural language processing, especially unsupervised methods, machine learning for structured data, and applications of natural language processing, including machine translation and statistical modeling of text and quantitative social data. He recently published a book, Linguistic Structure Prediction, about many of these topics; in 2009 he gave a tutorial at ICML about structured prediction in NLP.

Dani Yogatama is a PhD student in the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University. His research interests include statistical methods for natural language processing, with applications to text­driven forecasting. Prior to CMU, he was a Monbukagakusho fellow at the University of Tokyo.