|
|
The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Held at the Portland
Marriott Downtown Waterfront in
Portland, Oregon, USA, June 19-24, 2011
|
Automatic SummarizationPRESENTERS: Ani Nenkova, Sameer Maskey, Yang Liu
ABSTRACT:
In the past decade, we have seen that the amount of digital data,
such as news, scientific articles, blogs, conversations, increases
at an exponential pace. The need to address `information overload'
by developing automatic summarization systems has never been more
pressing. At the same time, approaches and algorithms for summarization
have matured and increased in complexity, and interest in summarization
research has intensified, with numerous publications on the topic each year.
A newcomer to the field may find navigating the existing literature to be
a daunting task. In this tutorial, we aim to give a systematic overview
of traditional and more recent approaches for text and speech summarization.
A core problem in summarization research is devising methods to estimate
the importance of a unit, be it a word, clause, sentence or utterance,
in the input. A few classical methods will be introduced, but the overall
emphasis will be on most recent advances. We will cover log-likelihood test
for topic word discovery and graph-based models for sentence importance,
and will discuss semantically rich approaches based on latent semantic
analysis, lexical resources. We will then turn to the most recent Bayesian
models of summarization. For supervised machine learning approaches,
we will discuss the suite of traditional features used in summarization,
as well as issued with data annotation and acquisition.
Ultimately, the summary will be a collection of important units.
The summary can be selected in a greedy manner, choosing the most
informative sentence, one by one; or the units can be selected jointly,
and optimized for informativeness. We discuss both approaches,
with emphasis on recent optimization work.
In the part on evaluation we will discuss the standard manual and
automatic metrics for evaluation, as well as very recent work on fully
automatic evaluation.
We then turn to domain specific summarization, particularly summarization
of scientific articles and speech data (telephone conversations,
broadcast news, meetings and lectures). In speech, the acoustic signal
brings more information that can be exploited as features in summarization,
but also poses unique problems which we discuss related to disfluencies,
lack of sentence or clause boundaries, and recognition errors.
We will only briefly touch on key but under-researched issues of linguistic
quality of summaries, deeper semantic analysis for summarization,
and abstractive summarization.
OUTLINE:
1. Computing informativeness
(a) Frequency-driven: topic words, clustering, graph approaches
(b) Semantic approaches: lexical chains, latent semantic analysis
(c) Probabilistic (Bayesian) models
(d) Supervised approaches
2. Optimizing informativeness and minimizing redundancy
(a) Maximal marginal relevance
(b) Integer linear programming
(c) Redundancy removal
3. Evaluation
(a) Manual evaluation: Responsivness and Pyramid
(b) Automatic: Rouge
(c) Fully automatic
4. Domain specific summarization
(a) Scientific articles
(b) Biographical
(c) Speech summarization
(i) Utterance segmentation
(ii) Acoustic features
(iii) Dealing with recognition errors
(iv) Disfluency removal and compression
PRESENTERS BIO
Ani Nenkova (Univ. of Pennsylvania)
330 Walnut St
UPenn, CIS, Levine 505
Philadelphia, PA 19104
Phone: 215-898-8745
Email: nenkova@seas.upenn.edu
Webpage: http://www.cis.upenn.edu/~nenkova
Ani Nenkova is an Assistant Professor of Computer and Information
Science at the University of Pennsylvania. She has worked extensively
in the area of text summarization and evaluation of text
summarization. She has recently developed methods for fully automatic
methods for evaluation of both linguistic quality and content
selection in summarization.
Sameer Maskey (IBM Research)
1101 Kitchawan Road
IBM, Yorktown Heights
New York, 10562
Phone: 914-945-1573
Email: smaskey@us.ibm.com
Webpage: http://www.cs.columbia.edu/~smaskey
Sameer Maskey is a Research Staff Member at IBM Research in Yorktown Heights,
New York. His main research interests are statistical techniques for
Natural Language and Speech processing, particularly
Machine Translation and Summarization of spoken documents. He has
previously worked on other topics such as Information Extraction,
Speech Synthesis and Question Answering.
Yang Liu (Univ. of Texas at Dallas)
800 W. Campbell. Rd., MS EC 31
The University of Texas at Dallas
Richardson, TX 75080, USA
Phone: 972-883-6618
Email: yangl@hlt.utdallas.edu
Webpage: http://www.hlt.utdallas.edu/~yangl
Yang Liu is an Assistant Professor of Computer Science at the University
of Texas at Dallas. Her research interests are in a broad range of topics
in speech and language processing, including summarization, spoken language
understanding, prosody modeling in speech, emotion recognition, NLP for
informal domains, and using speech and language technology for detection
of communication disorders.
| |