The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Held at the Portland Marriott Downtown Waterfront in
Portland, Oregon, USA, June 19-24, 2011


Automatic Summarization

PRESENTERS: Ani Nenkova, Sameer Maskey, Yang Liu ABSTRACT: In the past decade, we have seen that the amount of digital data, such as news, scientific articles, blogs, conversations, increases at an exponential pace. The need to address `information overload' by developing automatic summarization systems has never been more pressing. At the same time, approaches and algorithms for summarization have matured and increased in complexity, and interest in summarization research has intensified, with numerous publications on the topic each year. A newcomer to the field may find navigating the existing literature to be a daunting task. In this tutorial, we aim to give a systematic overview of traditional and more recent approaches for text and speech summarization. A core problem in summarization research is devising methods to estimate the importance of a unit, be it a word, clause, sentence or utterance, in the input. A few classical methods will be introduced, but the overall emphasis will be on most recent advances. We will cover log-likelihood test for topic word discovery and graph-based models for sentence importance, and will discuss semantically rich approaches based on latent semantic analysis, lexical resources. We will then turn to the most recent Bayesian models of summarization. For supervised machine learning approaches, we will discuss the suite of traditional features used in summarization, as well as issued with data annotation and acquisition. Ultimately, the summary will be a collection of important units. The summary can be selected in a greedy manner, choosing the most informative sentence, one by one; or the units can be selected jointly, and optimized for informativeness. We discuss both approaches, with emphasis on recent optimization work. In the part on evaluation we will discuss the standard manual and automatic metrics for evaluation, as well as very recent work on fully automatic evaluation. We then turn to domain specific summarization, particularly summarization of scientific articles and speech data (telephone conversations, broadcast news, meetings and lectures). In speech, the acoustic signal brings more information that can be exploited as features in summarization, but also poses unique problems which we discuss related to disfluencies, lack of sentence or clause boundaries, and recognition errors. We will only briefly touch on key but under-researched issues of linguistic quality of summaries, deeper semantic analysis for summarization, and abstractive summarization. OUTLINE: 1. Computing informativeness (a) Frequency-driven: topic words, clustering, graph approaches (b) Semantic approaches: lexical chains, latent semantic analysis (c) Probabilistic (Bayesian) models (d) Supervised approaches 2. Optimizing informativeness and minimizing redundancy (a) Maximal marginal relevance (b) Integer linear programming (c) Redundancy removal 3. Evaluation (a) Manual evaluation: Responsivness and Pyramid (b) Automatic: Rouge (c) Fully automatic 4. Domain specific summarization (a) Scientific articles (b) Biographical (c) Speech summarization (i) Utterance segmentation (ii) Acoustic features (iii) Dealing with recognition errors (iv) Disfluency removal and compression PRESENTERS BIO Ani Nenkova (Univ. of Pennsylvania) 330 Walnut St UPenn, CIS, Levine 505 Philadelphia, PA 19104 Phone: 215-898-8745 Email: nenkova@seas.upenn.edu Webpage: http://www.cis.upenn.edu/~nenkova Ani Nenkova is an Assistant Professor of Computer and Information Science at the University of Pennsylvania. She has worked extensively in the area of text summarization and evaluation of text summarization. She has recently developed methods for fully automatic methods for evaluation of both linguistic quality and content selection in summarization. Sameer Maskey (IBM Research) 1101 Kitchawan Road IBM, Yorktown Heights New York, 10562 Phone: 914-945-1573 Email: smaskey@us.ibm.com Webpage: http://www.cs.columbia.edu/~smaskey Sameer Maskey is a Research Staff Member at IBM Research in Yorktown Heights, New York. His main research interests are statistical techniques for Natural Language and Speech processing, particularly Machine Translation and Summarization of spoken documents. He has previously worked on other topics such as Information Extraction, Speech Synthesis and Question Answering. Yang Liu (Univ. of Texas at Dallas) 800 W. Campbell. Rd., MS EC 31 The University of Texas at Dallas Richardson, TX 75080, USA Phone: 972-883-6618 Email: yangl@hlt.utdallas.edu Webpage: http://www.hlt.utdallas.edu/~yangl Yang Liu is an Assistant Professor of Computer Science at the University of Texas at Dallas. Her research interests are in a broad range of topics in speech and language processing, including summarization, spoken language understanding, prosody modeling in speech, emotion recognition, NLP for informal domains, and using speech and language technology for detection of communication disorders.



acl2011.conference@gmail.com   ♦   Oregon Health & Science University