*** ACL2012 ***

TITLE

Graph-based Semi-Supervised Learning Algorithms for NLP

PRESENTERS

Amarnag Subramanya and Partha Pratim Talukdar

ABSTRACT

While labeled data is expensive to prepare, ever increasing amounts of unlabeled linguistic data are becoming widely available. In order to adapt to this phenomenon, several semi-supervised learning (SSL) algorithms, which learn from labeled as well as unlabeled data, have been developed. In a separate line of work, researchers have started to realize that graphs provide a natural way to represent data in a variety of domains. Graph-based SSL algorithms, which bring together these two lines of work, have been shown to outperform the state-of-the-art in many applications in speech processing, computer vision and NLP. In particular, recent NLP research has successfully used graph-based SSL algorithms for PoS tagging, semantic parsing, knowledge acquisition, sentiment analysis, and text categorization. Recognizing this promising and emerging area of research, this tutorial focuses on graph-based SSL algorithms (e.g., label propagation methods). The tutorial is intended to be a sequel to the ACL 2008 SSL tutorial, focusing exclusively on graph-based SSL methods and recent advances in this area, which were beyond the scope of the previous tutorial. The tutorial is divided in two parts. In the first part, we will motivate the need for graph-based SSL methods, introduce some standard graph-based SSL algorithms, and discuss connections between these approaches. We will also discuss how linguistic data can be encoded as graphs and show how graph-based algorithms can be scaled to large amounts of data (e.g., web-scale data). Part 2 of the tutorial will focus on how graph-based methods can be used to solve several critical NLP tasks, including basic problems such as PoS tagging, semantic parsing, coreference resolution and more downstream tasks such as text categorization and information acquisition, and sentiment analysis. We will conclude the tutorial with some exciting avenues for future work. Familiarity with semi-supervised learning and graph-based methods will not be assumed, and the necessary background will be provided. Examples from NLP tasks will be used throughout the tutorial to convey the necessary concepts. At the end of this tutorial, the attendee will walk away with the following: * An in-depth knowledge of the current state-of-the-art in graph-based SSL algorithms, and the ability to implement them. * The ability to decide on the suitability of graph-based SSL methods for a problem. * Familiarity with different NLP tasks where graph-based SSL methods have been successfully applied. In addition to the above goals, we hope that this tutorial will better prepare the attendee to conduct exciting research at the intersection of NLP and other emerging areas with natural graph-structured data (e.g., Computation Social Science). Please visit http://graph-ssl.wikidot.com/ for details.

OUTLINE

  Introduction
    • Why graph-based SSL methods?
    • Graph construction from linguistic data

  Graph-based SSL methods
    • Regularization-based methods

  Scaling to large data

  Applications in NLP problems
    • PoS Tagging
    • Bilingual Projection
    • Semantic Parsing
    • Text Categorization
    • Information Acquisition

  Conclusion
    • Open problems

PRESENTER BIOS

• Amarnag Subramanya is a Senior Research Scientist in Machine Learning Natural Language Processing at Google Research. Amarnag received his PhD (2009) from the University of Washington, Seattle, working under the supervision of Jeff Bilmes. His research interests include machine learning and graphical models. In particular he is interested in the application of semi-supervised learning to large-scale problems in natural language processing. His dissertation focused on improving the performance and scalability of graph-based semi-supervised learning algorithms for problems in natural language, speed and vision. He was the recipient of the Microsoft Research Graduate fellowship in 2007. He recently co-organized a session on "Semantic Processing" at the National Academy of Engineering's (NAE) Frontiers of Engineering (USFOE) conference.

Amarnag Subramanya
Google Research
1600 Amphitheater Pkwy.
Mountain View, CA 94043
Email: asubram-AT-google.com
Web: http://sites.google.com/site/amarsubramanya

• Partha Pratim Talukdar is a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University, working with Tom Mitchell on the NELL project. Partha received his PhD (2010) in CIS from the University of Pennsylvania, working under the supervision of Fernando Pereira, Zack Ives, and Mark Liberman. Partha is broadly interested in Machine Learning, Natural Language Processing, and Data Integration, with particular interest in large-scale learning and inference over graphs. His dissertation introduced novel graph-based weakly-supervised methods for Information Extraction and Integration. His past industrial research affiliations include HP Labs, Google Research, and Microsoft Research. Partha is a co-organizer of the NAACL-HLT 2012 workshop on web-scale knowledge extraction from text (AKBC-WEKEX 2012), and an Area Co-Chair for EMNLP-CoNLL 2012.

Partha Pratim Talukdar
GHC 8133, Machine Learning Department
Carnegie Mellon University
5000 Forbes Ave., Pittsburgh, PA 15213
Email: partha.talukdar-AT-cs.cmu.edu
Web: http://www.talukdar.net