We study the use of temporal information in the form of timelines to enhance multi-document summarization. We employ a fully automated temporal processing system to generate a timeline for each input document. We derive three features from these timelines, and show that their use in supervised summarization lead to a significant 4.1% improvement in ROUGE performance over a state-of-the-art baseline. In addition, we propose TimeMMR, a modification to Maximal Marginal Relevance that promotes temporal diversity by way of computing time span similarity, and show its utility in summarizing certain document sets. We also propose a filtering metric to discard noisy timelines generated by our automatic processes, to purify the timeline input for summarization. By selectively using timelines guided by filtering, overall summarization performance is increased by a significant 5.9%.
There has been a good amount of research invested into improving the temporal interpretation of text. Besides the increasing availability of annotation standards (e.g., TimeML [21]) and corpora (e.g., TIDES [9], TimeBank [22]), the community has also organized three successful evaluation workshops — TempEval-1 [25], -2 [26], and -3 [24]. As the state-of-the-art improves, these workshops have moved away from the piecemeal evaluation of individual temporal processing tasks and towards the evaluation of complete end-to-end systems in TempEval-3. We believe our understanding of the temporal information found in text is sufficiently robust, and that there is an opportunity to now leverage this information in downstream applications. In this paper, we present our work in incorporating the use of such temporal information into multi-document summarization.
The goal of multi-document summarization is to generate a summary which includes the main points from an input collection of documents with minimal repetition of similar points. We hope to improve the quality of the summaries that are generated by considering temporal information found in the input text. To motivate how temporal information can be useful in summarization, let us refer to Figure 1. The three sentences describe a recent cyclone and a previous one which happened in 1991. Recognizing that sentence (3) is about a storm that had happened in the past is important when writing a summary about the recent storm, as it is not relevant and can likely be excluded.
(1) A fierce cyclone packing extreme winds and torrential rain smashed into Bangladesh’s southwestern coast Thursday, wiping out homes and trees in what officials described as the worst storm in years. (2) More than 100,000 coastal villagers have been evacuated before the cyclone made landfall. (3) The storm matched one in 1991 that sparked a tidal wave that killed an estimated 138,000 people, Karmakar told AFP.
It is reasonable to expect that a collection of documents about the recent storm will contain more references to it, compared with the earlier one that happened in 1991. Visualized on a timeline, this will translate to more events (bolded in Figure 1) around the time when the recent storm occurred. There should be fewer events mentioned in the collection for the earlier 1991 time period. Figure 2 illustrates a possible timeline laid out with the events found in Figure 1. The events from the more recent storm are found together at the same time. There are fewer events which talk about the previous storm. Thus, temporal information does assist in identifying which sentences are more relevant to the final summary.
Our work is significant as it addresses an important gap in the exploitation of temporal information. While there has been prior work making use of temporal information for multi-document summarization, they 1) have been largely confined to helping to chronologically order content within summaries [2], or 2) focus only on the use of recency as an indicator of saliency [10, 27]. In this work we construct timelines (as a representation of temporal information) automatically and incorporate them into a state-of-the-art multi-document summarization system. This is achieved with 1) three novel features derived from timelines to help measure the saliency of sentences, as well as 2) TimeMMR, a modification to the traditional Maximal Marginal Relevance (MMR) [3]. TimeMMR promotes diversity by additionally considering temporal information instead of just lexical similarities. Through these, we demonstrate that temporal information is useful for multi-document summarization. Compared to a competitive baseline, significant improvements of up to 4.1% are obtained.
Automatic temporal processing systems are not perfect yet, and this may have an impact on their use for downstream applications. This work additionally proposes the use of the lengths of timelines as a metric to gauge the usefulness of timelines. Together with the earlier described contributions, this metric further improves summarization, yielding an overall 5.9% performance increase.
Barzilay et al. (1999) were one of the first to use time for multi-document summarization. They recognized the importance of generating a summary which presents the time perspective of the summarized documents correctly. They estimated the chronological ordering of events with a small set of heuristics, and also made use of lexical patterns to perform basic time normalization on terms like “today” relative to the document creation time. The induced ordering is used to present the selected summary content, following the chronological order in the original documents.
In another line of work, Goldstein et al. (2000) made use of the temporal ordering of documents to be summarized. In computing the relevance of a passage for inclusion into the final summary, they considered the recency of the passage’s source document. Passages from more recent documents are deemed to be more important. Wan (2007) and Demartini et al. (2010) made similar assumptions in their work on TimedTextRank and entity summarization, respectively.
Instead of just considering the notion of recency, Liu et al. (2009) proposed an interesting approach using a temporal graph. Events within a document set correspond to vertices in their proposed graph, while edges are determined by the temporal ordering of events. From the resulting weakly-connected graph, the largest forests are assumed to contain key topics within the document set and used to influence a scoring mechanism which prefers sentences touching on these topics.
Wu (2008) also made use of the relative ordering of events. He assigned complete timestamps to events extracted from text. After laying out these events onto a timeline by making use of these timestamps, the number of events that happen within the same day is used to influence sentence scoring. The motivation behind this approach is that days which have a large number of events should be more important and more worthy of reporting than others.
These prior works target either 1) sentence re-ordering, or 2) the use of recency as an indicator of saliency. In sentence re-ordering, final summaries are re-arranged so that the extracted sentences that form the summary are in a chronological order. We argue that this may not be appropriate for all summaries. Depending on the style of writing or journalistic guidelines, a summary can arguably be written in a number of ways. The use of recency as an indicator of saliency is useful, yet disregards other accessible temporal information. If a summary of a whole sequence of events is desired, recency becomes less useful.
The work of Wu (2008) is closely related to one of the features proposed in this paper. He had also made use of temporal information to weight sentences to generate summaries. However his approach is guided by the number of events happening within the same time span, and relies on event co-referencing. In this work, we have simplified this idea by dropping the need for event co-referencing (removing a source of propagated error), and augmented it with two additional features derived from timelines. By doing so, we are able to make better use of the available temporal information, taking into account all known events and the time in which they occur.
A useful note here is that this work is arguably different from the Temporal Summarization (TmpSum) track at the Text Retrieval Conference [1]. Given a large stream of data in real-time, the purpose of the TmpSum track is to look out for a query event, and retrieve specific details about the event over a period of time. Systems are also expected to identify the source sentences from which these details are retrieved. This is not the same as our approach here, which makes use of temporal information encoded in timelines to generate prose summaries.
To incorporate temporal information into multi-document summarization, we adopt the workflow in Figure 3, which has two key processes: 1) temporal processing, and 2) summarization.
Temporal Processing generates timelines from text, one for each input document. Timelines are well-understood constructs which have often been used to represent temporal information [7, 8]. They indicate the temporal relationships between two basic temporal units: 1) events, and 2) time expressions (or timexes for short). In this work, we adopt the definitions proposed in the standardized TimeML annotation [21]. An event refers to an eventuality, a situation that occurs or an action; while a timex is a reference to a particular date or time (e.g. “2013 December 31”).
Following the “divide-and-conquer” approach described in Verhagen et al. (2010), results from the three temporal processing steps: 1) timex normalization, 2) event-timex temporal relationship classification, and 3) event-event temporal relationship classification, are merged to obtain timelines (top half of Figure 3). We tap on existing systems for each of these steps [18, 23, 17].
Summarization. We make use of a state-of-the-art summarization system, SWING [16] (bottom half of Figure 3). SWING is a supervised, extractive summarization system which ranks sentences based on scores computed using a set of features in the Sentence Scoring phase. The Maximal Marginal Relevance (MMR) algorithm is then used in the Sentence Re-ordering phase to re-order and select sentences to form the final summary. The timelines built in the earlier temporal processing can be incorporated into this pipeline by deriving a set of features used to score sentences in Sentence Scoring, and as input to the MMR algorithm when computing similarity in Sentence Re-ordering.
A typical timeline used in this work has been shown earlier in Figure 2. The arrowed, horizontal axis is the timeline itself. The timeline can be viewed as a continuum of time, with points on the timeline referring to specific moments of time. Small solid blocks on the timeline itself are references to absolute timestamps along the timeline (e.g., “2013-Feb-13 11:32 +0000” in the figure).
The black square boxes above the timeline denote events. Events can either occur at a specific instance of time (e.g., an explosion), or over a period of time (e.g. a football match). Generalizing, we refer to the time period an event takes place in as its time span (vertical dotted lines). As a simplifying assumption, events are laid out on the timeline based on the starting time of their time span. Note that in our work, time spans may not correspond to specific instances of time, but instead help in inferring an ordering of events. Events which appear to the left of others take place earlier, while events within the same time span happen together over the same time period.
We derive three features from the constructed timelines, which are then used for downstream Sentence Scoring. Figure 4 shows a simplified timeline, along with annotations that are referenced in this section to help explain how these timeline features are derived.
1. Time Span Importance (TSI). We hypothesize that when more events happen within a particular time span, that time span is potentially more relevant for summarization. Sentences that mention events found in such a time span should be assigned higher scores. Referring to Figure 1, whose timeline is shown in Figure 2, we see that the time span with the most number of events is when the latest cyclone made landfall. Assigning higher scores for sentences which contain events in this time span will help us to select more relevant sentences if we want a summary about the cyclone.
Let be the time span with the largest number of events in a timeline. The importance of a time span is computed by normalizing the number of events in against the number of events in . The of a sentence is then the sum of the time span importance associated to all the words in :
(1) |
where denotes the time span which a word is associated with, and is the number of events within the time span.
2. Contextual Time Span Importance (CTSI). The importance of a time span may not depend solely on the number of events that happen within it. If it is near time spans which are “important” (i.e., one that has a large number of events), it should also be of relative importance. A more concrete illustration of this can also be seen in Figure 1. Sentence (2) explains that a lot of people have been evacuated prior to the cyclone making landfall. It is imaginable that this can be useful information to be included in a summary, even though from looking at the corresponding timeline in Figure 2, the “evacuated” event falls in a time span with a low importance score (i.e., the time span only has one event). CTSI seeks to promote sentences such as this.
We derive the CTSI of a sentence by first computing the contextual importance of words in the sentence. We define the contextual importance of a word found in time span as a weighted sum of the time span importance of the two nearest peaks and found to the left and right of , respectively. In Figure 4, taking reference from event (shaded in black), the left peak to the time span which is in happens to be time span , while the right peak is time span . The contribution of each peak to the weighted sum is decayed by its distance from . Formally, the contextual time span importance of a word can be expressed as:
(2) |
where is the time span associated with . and are the time span importance of the peaks to the left and right of respectively, while and are the number of time spans between the left and right peaks of respectively. balances the importance of the left and right peaks, intuitively set to . The CTSI of a sentence is computed as:
(3) |
where denotes the set of events words in .
3. Sentence Temporal Coverage Density (TCD). We first define the temporal coverage of a sentence. This corresponds to the number of time spans that the events in a sentence talk about. Suppose a sentence contains events which are associated with time spans , , . The time spans are ordered in the sequence they appear on the timeline. Then the temporal coverage of a sentence is defined as the number of time spans between the earliest time span and the latest time span . Referring to Figure 4, suppose a sentence contains the three events which have been shaded black. The temporal coverage in this case includes all the time spans from time span to time span , inclusive.
The constraint on the number of sentences that can be included in a summary requires us to select compact sentences which contain as many relevant facts as possible. Traditional lexical measures may attempt to achieve this by computing the ratio of keyphrases to the number of words in a sentence [11]. Stated equivalently, when two sentences are of the same length, if one contains more keyphrases, it should contain more useful facts. TCD parallels this idea with the use of temporal information, i.e. if two sentences are of the same temporal coverage, then the one with more events should carry more useful facts.
Formally, if a sentence contains events , where each event is associated with a time span , then is computed using:
(4) |
where is the number of events found in , and is the temporal coverage of .
(1) An official in Barisal, 120 kilometres south of Dhaka, spoke of severe destruction as the 500 kilometre-wide mass of cloud passed overhead. (2) “Many trees have been uprooted and houses and schools blown away,” Mostofa Kamal, a district relief and rehabilitation officer, told AFP by telephone. (3) “Mud huts have been damaged and the roofs of several houses blown off,” said the state’s relief minister, Mortaza Hossain.
In the sentence re-ordering stage of the SWING pipeline, the iterative MMR algorithm is used to adjust the score of a candidate sentence, . In each iteration, is penalized if it is lexically similar to other sentences that have already been selected to form the eventual summary . The motivating idea is to reduce repeated information by preferring sentences which bring in new facts.
Incorporating temporal information can potentially improve this. In Figure 5, the sentences describe many events which took place within the same time span. They describe the destruction caused by a hurricane with trees uprooted and buildings blown away. A summary about the hurricane need not contain all of these sentences as they are all describing the same thing. However it is not trivial for the lexically-motivated MMR algorithm to detect that events like “passed”, “uprooted” or “damaged” are in fact repetitive.
Thus, we propose further penalizing the score of if it contains events that happen in similar time spans as those contained in sentences within . We refer to this as TimeMMR. Modifying the MMR equation from Ng et al. (2012):
(5) |
where is the score of , is the set of sentences already selected to be in the summary from previous iterations, and is the predicted ROUGE-2 score of with respect to the already selected sentences (). is a weighting parameter which is empirically set to 0.9 after tuning over a development dataset. is the proportion of events in which happen in the same time span as another event in any other sentence in . Two events are said to be in the same time span if one happens within the time period the other happens in. For example, an event that takes place in “2014 June” is said to take place within the year “2014”.
While TimeMMR is proposed here as an improvement over MMR, the premise is that incorporating temporal information can be helpful to minimize redundancy in summaries. In future work, one could apply it to other state-of-the-art lexical-based approaches including that of Hendrickx et al. (2009) and Celikyilmaz and Hakkani-Tur (2010). We also believe the same idea can be transplanted even to non-lexical motivated techniques such as the corpus-based similarity measure proposed by Xie and Liu (2008). We chose to use MMR here as a proof-of-concept to demonstrate the viability of such a technique, and to easily integrate our work into SWING.
Temporal processing is imperfect. Together with the simplifying assumptions that were made in timeline construction, our generated timelines have errors which propagate into the summarization process. With this in mind, we selectively employ timelines to generate summaries only when we are confident of their accuracy. This can be done by computing a metric which can be used to decide whether or not timelines should be used for a particular input document collection. We refer to this as reliability filtering.
We postulate that the length of a timeline can serve as a simple reliability filtering metric. The intuition for this is that for longer timelines (which contain more events), possible errors are spread over the entire timeline, and do not overpower any useful signal that can be obtained from the timeline features outlined earlier. Errors are however very easily propagated into summary generation for shorter timelines, leading to less useful results.
We incorporate this into our process as follows: given an input document collection (which consists of 10 documents), the average size of all the timelines for each of these 10 documents is computed. Only when this value is larger than a threshold value are the timelines used.
The proposed timeline features and TimeMMR were implemented on top of SWING, and evaluated on the test documents from TAC-2011 [19]. SWING makes use of three generic features and two features targeted specifically at guided summarization. Since the focus of this paper is on multi-document summarization, we employ only the three generic features, i.e., 1) sentence position, 2) sentence length, and 3) interpolated n-gram document frequency in our experiments below. Summarization evaluation is done using ROUGE-2 (R-2) [13], as it has previously been shown to correlate well with human assessment [14] and is often used to evaluate automatic text summarization.
Configuration | R-2 | Sig | |
---|---|---|---|
R | SWING | 0.1339 | NA |
B1 | CLASSY | 0.1278 | - |
B2 | POLYCOM | 0.1227 | ** |
Without Filtering | |||
1 | SWING+TSI+CTSI+TCD | 0.1394 | * |
2 | SWING+TSI+CTSI | 0.1372 | - |
3 | SWING+TSI+TCD | 0.1372 | - |
4 | SWING+CTSI+TCD | 0.1387 | * |
5 | SWING+TSI+CTSI+TCD+TMMR | 0.1389 | - |
6 | SWING+TSI+CTSI+TMMR | 0.1374 | - |
7 | SWING+TSI+TCD+TMMR | 0.1343 | - |
8 | SWING+CTSI+TCD+TMMR | 0.1363 | - |
With Filtering | |||
9 | SWING+TSI+CTSI+TCD | 0.1418 | ** |
10 | SWING+TSI+CTSI | 0.1378 | ** |
11 | SWING+TSI+TCD | 0.1389 | ** |
12 | SWING+CTSI+TCD | 0.1401 | ** |
13 | SWING+TSI+CTSI+TCD+TMMR | 0.1402 | ** |
14 | SWING+TSI+CTSI+TMMR | 0.1397 | ** |
15 | SWING+TSI+TCD+TMMR | 0.1376 | * |
16 | SWING+CTSI+TCD+TMMR | 0.1390 | ** |
The results obtained are shown in Table 1. In the table, each row refers to a specific summarization system configuration. We also show the results of two reference systems, CLASSY [5] and POLYCOM [30], as benchmarks. CLASSY and POLYCOM are top performing systems at TAC-2011 (ranked 2nd and 3rd by R-2 in TAC 2011, respectively; the full version of SWING was ranked 1st with a R-2 score of 0.1380). From these results, we can see that SWING is a very competitive baseline.
Rows 9 to 16 additionally incorporate our timeline reliability filtering. We assume that the various input document sets to be summarized are available at the time of processing. Hence in these experiments, the threshold for filtering is set to be the average of all the timeline sizes over the whole input dataset. In a production environment where this assumption may not hold, this threshold could be set by empirical tuning over a development set.
Row 1 shows the usefulness of the proposed timeline-based features. A statistically significant improvement of 4.1% is obtained with the use of all three features over SWING. When we use reliability filtering (Row 9), this improvement increases to 5.9%.
The ablation test results in Rows 2 to 4 show a drop in R-2 each time a feature is left out. With the exception of Row 4, removing a feature lessens the improvement in R-2 to be insignificant from SWING’s. The same drop occurs even when reliability filtering is used (Rows 9 to 12). These indicate that all the proposed features are important and need to be used together to be effective.
Rows 5 to 8 and Rows 13 to 16 show the effect of TimeMMR. While the results do not uniformly show that TimeMMR is effective, it can be helpful, such as when comparing Rows 2 and 6, or Rows 10 and 14, where R-2 improves marginally.
Looking at Rows 1 to 8, and Rows 9 to 16, we see the importance of reliability filtering. It is able to guide the use of timelines such that significant improvements in R-2 over SWING are obtained.
To help visualize what the differences in these ROUGE scores mean, Figure 7 shows two summaries11The produced summaries are truncated to fit within a 100-word limit imposed by the TAC-2011 guidelines. generated for document set D1117C of the TAC-2011 dataset. The left one is produced by the configuration in Row 9, and the right one is produced by SWING without the use of any temporal information.
(1,1) The Army’s surgeon general criticized stories in The Washington Post disclosing problems at Walter Reed Army Medical Center, saying the series unfairly characterized the living conditions and care for soldiers recuperating from wounds at the hospital’s facilities. (2) Defense Secretary Robert Gates says people found to have been responsible for allowing substandard living conditions for soldier outpatients at Walter Reed Army Medical Center in Washington will be “held accountable,” although so far no one in the Army chain of command has offered to resign. [5] (2) A top Army general vowed to personally oversee the upgrading of Walter Reed Army Medical Center’s Building 18, a dilapidated former hotel that houses wounded soldiers as outpatients. [2] (3) Top Army officials visited Building 18, the decrepit former hotel housing more than 80 recovering soldiers, outside [1] [3] (3) “I’m not sure it was an accurate representation,” Lt. Gen. Kevin Kiley, chief of the Army Medical Command which oversees Walter Reed and all Army health care, told reporters during a news conference. (4) The Washington
R-2: 0.0873
R-2: 0.0723
The higher R-2 score obtained by the summary on the left (0.0873) compared to the one on the right (0.0723) suggests that temporal information can help to identify salient sentences more accurately. A closer look at sentences (2) and (2) and their R-2 scores (0.0424 and 0.0249, respectively) is instructive. Figure 6 shows the raw feature scores of both sentences. Both sentences score similarly for the SWING features of sentence position (SP), sentence length (Length), and interpolated n-gram document frequency (INDF); however, the scores for all three timeline features higher for (2) than (2). This helps our time sensitive system prefer (2).
We now examine the proposed 1) timeline features, 2) TimeMMR algorithm, and 3) reliability filtering metric in greater detail to gain insight into their efficacy. For the analysis on timeline features, we only present an analysis for TSI and CTSI due to space constraints.
Time Span Importance. Figure 8 shows the last sentences from a pair of summaries generated with and without the use of TSI (all other sentences were the same). The original articles describe an accident where casualties were suffered when a crane toppled onto a building. It is easy to see why (1) scores higher for R-2 — it describes the cause of the accident just as it occurred. (1) however talks about events which happened before the accident itself (e.g., how much of the tower had already been erected). In this case time span importance is able to correctly guide summary generation by favoring time spans containing events related to the actual toppling.
(1) A piece of steel fell and sheared off one of the ties holding it to the building, causing it to detach and topple, said Stephen Kaplan [1] (1) About 19 of the 44 stories of the crane had been erected and it was to be extended when a piece of steel fell and sheared
R-2: 0.1683
R-2: 0.1533
Contextual Time Span Importance. CTSI recognizes that events which happen around the time of a big cluster of other events can be important too. The benefits of this feature can be clearly seen in Figure 9. The summary on the left achieved a R-2 score of 0.1215 while the one on the right achieved 0.0861. (2) and (3) were both boosted by the use of the contextual importance feature.
((1,1) Caribbean coral species essential to the region’s reef ecosystems are at risk of extinction as a result of climate change. (2) But destructive fishing methods and over-harvesting have reduced worldwide catches by 90 percent in the past two decades. [1] (2) The Coral Reef Task Force, created in the Clinton administration, regularly assesses coral health. (3) Scientists warn that up to half of the world’s coral reefs could disappear by 2045. [1] [2] (3) With a finished necklace retailing for up to 20,000 dollars (15,000 euros), red corals are among the world’s most expensive wildlife commodities.
R-2: 0.1215
R-2: 0.0861
Figure 10 shows an extract of the timeline generated for the source document from which (3) is extracted. The two events inside (3) fall in time spans and marked in the figure. Their proximity to the peak between them gives the sentence a higher score for CTSI. This boost results in the sentence being selected for inclusion in the final summary. It turns out that this sentence was lifted exactly in one of the model summaries for this document set, resulting in a very good R-2 score when contextual importance is used.
Is TimeMMR Useful? The experimental results do not conclusively affirm the usefulness of TimeMMR. However we believe it is because the ROUGE measures that are used for evaluation are not suited for this purpose. Recall that TimeMMR seeks to eliminate redundancy based on time span similarities and not lexical likeness. ROUGE, however, measures the latter.
An interesting case in point is given in Figure 11. The summary on the left is generated using TimeMMR and achieved a lower ROUGE score. The one on the right is generated without TimeMMR and scores higher, suggesting that TimeMMR is not helpful. The key difference in the two summaries is (3). (3) is the equivalent of (4), while (4) is the full version of the truncated (5). TimeMMR penalizes (3). (3) reports that the shoe-throwing incident happened as the U.S. President Bush appeared together with the Iraqi Prime Minister Nouri al-Maliki. However their joint appearance is already reported in (1) (and similarly (1)). (3) repeats what had been presented earlier. Since (1) and (3) talk about the same time span, TimeMMR down-weights (3). We argue that this is better even though the ROUGE scores indicate otherwise. In future work it will be worthwhile to consider the use of metrics like Pyramid [20] which are less bound to superficial lexicons.
(1,1) – An Iraqi reporter threw his shoes at visiting U.S. President George W. Bush and called him a ”dog” in Arabic during a news conference with Iraqi Prime Minister Nuri al-Maliki in Baghdad (2,2) ”All I can report is it is a size 10,. (3) Muntadhar al-Zaidi, reporter of Baghdadiya television jumped and threw his two shoes one by one at the president, who ducked and thus narrowly missed being struck, raising chaos in the hall in Baghdad’s heavily fortified green Zone. [3] (3) The incident occurred as Bush was appearing with Iraqi Prime Minister Nouri al-Maliki. [2] (4) The president lowered his head and the first shoe hit the American and Iraqi flags behind the two leaders. [1] [3] (4) Muntadhar al-Zaidi, reporter of Baghdadiya television jumped and threw his two shoes one by one at the president, who ducked and thus narrowly missed being struck, raising chaos in the hall in Baghdad’s heavily fortified green Zone. (5) The (5) The president lowered his head and the
R-2: 0.2643
R-2: 0.2772
T | R-2 | Sig | # | T | R-2 | Sig | # |
---|---|---|---|---|---|---|---|
0 | 0.1394 | * | 44 | 50 | 0.1386 | ** | 13 |
10 | 0.1382 | - | 43 | 60 | 0.1361 | * | 7 |
20 | 0.1377 | - | 41 | 70 | 0.1351 | - | 3 |
30 | 0.1393 | ** | 35 | 80 | 0.1351 | - | 2 |
40 | 0.1426 | ** | 22 | 90 | 0.1353 | - | 1 |
42.68 | 0.1418 | ** | 21 | 100 | 0.1339 | - | 0 |
Reliability Filtering. Table 2 shows the effect of varying the filtering threshold on R-2 for the best performing configuration from Table 1 (i.e., SWING+TSI+CTSI+TCD). The result obtained in Row 9 using a threshold of 42.68 is also re-produced for reference. T=0 means that timelines are used for all input document sets, whereas T=100 means that no timelines are used, as the length of the longest timeline is less than 100.
As the threshold increases from 0 to 40–50, summarization performance improves while the number of document sets where temporal information is used is reduced. This suggests that filtering is successful in identifying timelines that are not sufficiently accurate to be useful for summarization. R-2 performance peaks around a threshold of 40. This affirms our use of the average length of timelines as the threshold value in our earlier experiments. Beyond 60, the R-2 scores are still higher than that obtained by SWING, but no longer significantly different. At these higher thresholds, temporal information is still able to help get an improvement in R-2. However as this affects only very few out of the 44 document sets, statistical variances mean that these R-2 scores are no longer significant from that produced by SWING.
We have shown in this work how temporal information in the form of timelines can be incorporated into multi-document summarization. We achieve this through two means, using: 1) three novel features derived from timelines to measure the saliency of sentences, and 2) TimeMMR which considers time span similarity to enhance the traditional MMR’s lexical diversity measure.
To overcome errors propagated from the underlying temporal processing systems, we proposed a reliability filtering metric which can be used to help decide when temporal information should be used for summarization. The use of this metric leads to an overall 5.9% gain in R-2 over the competitive SWING baseline.
In future work, we are keen to study our proposed timeline-related features more intrinsically in the context of human-generated summaries. This can help us better understand their value in improving content selection. As noted earlier, it will be also be useful to repeat our experiments with less lexicon-influenced measures like the Pyramid method [20]. Manual assessment of the generated summaries can also be done to give a better picture of the quality of the summaries generated with the use of timelines. Finally, given the importance of reliability filtering, a natural question is if there are other metrics that can be used to get better results.
This research is supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.
This work is also partially supported by the National Natural Science Foundation of China (Grant Nos. 61170189, 61370126, 61202239), the Fund of the State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2013ZX-19), and the Innovation Foundation of Beihang University for Ph.D. Graduates (YWF-13-T-YJSY-024).