We propose a novel abstractive query-based summarization system for conversations, where queries are defined as phrases reflecting a user information needs. We rank and extract the utterances in a conversation based on the overall content and the phrasal query information. We cluster the selected sentences based on their lexical similarity and aggregate the sentences in each cluster by means of a word graph model. We propose a ranking strategy to select the best path in the constructed graph as a query-based abstract sentence for each cluster. A resulting summary consists of abstractive sentences representing the phrasal query information and the overall content of the conversation. Automatic and manual evaluation results over meeting, chat and email conversations show that our approach significantly outperforms baselines and previous extractive models.
Our lives are increasingly reliant on multimodal conversations with others. We email for business and personal purposes, attend meetings in person, chat online, and participate in blog or forum discussions. While this growing amount of personal and public conversations represent a valuable source of information, going through such overwhelming amount of data, to satisfy a particular information need, often leads to an information overload problem [14]. Automatic summarization has been proposed in the past as a way to address this problem (e.g., [25]). However, often a good summary cannot be generic and should be a brief and well-organized paragraph that answer a user’s information need.
The Document Understanding Conference (DUC)11http://www-nlpir.nist.gov/projects/duc/index.html has launched query-focused multidocument summarization as its main task since 2004, by focusing on complex queries with very specific answers. For example, “How were the bombings of the US embassies in Kenya and Tanzania conducted? How and where were the attacks planned?”. Such complex queries are appropriate for a user who has specific information needs and can formulate the questions precisely. However, especially when dealing with conversational data that tend to be less structured and less topically focused, a user is often initially only exploring the source documents, with less specific information needs. Moreover, following the common practice in search engines, users are trained to form simpler and shorter queries [21]. For example, when a user is interested in certain characteristics of an entity in online reviews (e.g., “location” or “screen”) or a specific entity in a blog discussion (e.g., “new model of iphone”), she would not initially compose a complex query.
To address these issues, in this work, we tackle the task of conversation summarization based on phrasal queries. We define a phrasal query as a concatenation of two or more keywords, which is a more realistic representation of a user’s information needs. For conversational data, this definition is more similar to the concept of search queries in information retrieval systems as well as to the concept of topic labels in the task of topic modeling. Example 1 shows two queries and their associated human written summaries based on a single chat log. We can observe that the two summaries, although generated from the same chat log, are totally distinct. This further demonstrates the importance of phrasal query-based summarization systems for long conversations.
To date, most systems in the area of summarization focus on news or other well-written documents, while research on summarizing multiparty written conversations (e.g., chats, emails) has been limited. This is because traditional NLP approaches developed for formal texts often are not satisfactory when dealing with multiparty written conversations, which are typically in a casual style and do not display a clear syntactic structure with proper grammar and spelling. Even though some works try to address the problem of summarizing multiparty written conversions (e.g., [20, 29, 23, 32, 9]), they do so in a generic way (not query-based) and focus on only one conversational domain (e.g., meetings). Moreover, most of the proposed systems for conversation summarization are extractive.
To address such limitations, we propose a fully automatic unsupervised abstract generation framework based on phrasal queries for multimodal conversation summarization. Our key contributions in this work are as follows:
Query-1: Test/Sample database for GNUe |
---|
Abstract-1: James Thompson asked Reinhard: I was going to work on the sample tonight. You mentioned wanting a fishhook and all data types. Any other things you want to see in there? Reinhard said that master/detail would be good, as there have been bugs only appearing in 3-level case. James said he already included that and I know I need to add a boolean. Did you want date as well as date-time? Reinhard said yes - we also have time values (time without date). They are especially interesting. James had not ever had use for something like that so I’m not sure where I would graft that in. |
Query-2: Passing parameters to Forms |
Abstract-2: James Thompson (jamest) asked how did parameter support in forms change recently? He reported the trigger namespace function referencesGFForm.parameters - which no longer exists. Reinhard said every GFForm should have a parameters. James said he was using parameters in on-startup. Reinhard said that’s probably the only place where they don’t work. James said that I’m thinking about moving that to on-activation instead of on-startup anyway as it should still work for a main form - but i still wonder if the on-startup parameter issue should be considered a bug - as it shouldn’t choke. Reinhard was sure it should be considered a bug but I have no idea how to fix it. We haven’t found a way to deal with parameters that works for every case. I don’t know if there is any chance to pass the parameters to the form before it is activated. James asked how are parameters handled now? Reinhard replied that they are passed to activateForm so they are available from activation for the –main– form, the command line parameters are passed and for dialogs, the parameters are passed that were given in runDialog. |
1) To the best of our knowledge, our framework is the first abstractive system that generates summaries based on usersâ phrasal queries, instead of well-formed questions. As a by-product of our approach, we also propose an extractive summarization model based on phrasal queries to select the summary-worthy sentences in the conversation based on query terms and signature terms [17].
2) We propose a novel ranking strategy to select the best path in the constructed word graph by taking the query content, overall information content and grammaticality (i.e., fluency) of the sentence into consideration.
3) Although most of the current summarization approaches use supervised algorithms as a part of their system (e.g., [30]), our method can be totally unsupervised and does not depend on human annotation.
4) Although different conversational modalities (e.g., email vs. chat vs. meeting) underline domain-specific characteristics, in this work, we take advantage of their underlying similarities to generalize away from specific modalities and determine effective method for query-based summarization of multimodal conversations.
We evaluate our system over GNUe Traffic archive22http://kt.earth.li/GNUe/index.html Internet Relay Chat (IRC) logs, AMI meetings corpus [4] and BC3 emails dataset [26]. Automatic evaluation on the chat dataset and manual evaluation over the meetings and emails show that our system uniformly and statistically significantly outperforms baseline systems, as well as a state-of-the-art query-based extractive summarization system.
[scale=0.55]framework-small.pdf
Our phrasal query abstraction framework generates a grammatical abstract from a conversation following three steps, as shown in Figure 1.
Abstractive summary sentences can be created by aggregating and merging multiple sentences into an abstract sentence. In order to generate such a sentence, we need to identify which sentences from the original document should be extracted and combined to generate abstract sentences. In other words, we want to identify the summary-worthy sentences in the text that can be combined into an abstract sentence. This task can be considered as content selection. Moreover, this step, stand alone, corresponds to an extractive summarization system.
In order to select and extract the informative summary-worthy utterances, based on the phrasal query and the original text, we consider two criteria: i) utterances should carry the essence of the original text; and ii) utterances should be relevant to the query. To fulfill such requirements we define the concepts of signature terms and query terms.
Signature terms are generally indicative of the content of a document or collection of documents. To identify such terms, we can use frequency, word probability, standard statistic tests, information-theoretic measures or log-likelihood ratio. In this work, we use log-likelihood ratio to extract the signature terms from chat logs, since log-likelihood ratio leads to better results [12]. We use a method described in [17] in order to identify such terms and their associated weight. Example 2 demonstrates a chat log and associated signature terms.
Signature terms: navigator, functionality, reports, UI, schema, gnu |
---|
Chat log: |
- but watching them build a UI in the flash demo’s is pretty damn impressive… and have started moving my sales app to all UI being built via … |
- i’ll be expanding the technotes in navigator for a while … |
- … in terms of functionality of the underlying databases … |
- you mean if I start GNU again I have to read bug reports too? |
- no, just in case you want to enter bug report |
- …I expand the schema before populating with test data … |
- i’m willing to scrap it if there is a better schema hidden in gnue somewhere :) |
Query terms are indicative of the content in a phrasal query. In order to identify such terms, we first extract all content terms from the query. Then, following previous studies (e.g., [10]), we use the synsets relations in WordNet for query expansion. We extract all concepts that are synonyms to the query terms and add them to the original set of query terms. Note that we limit our synsets to the nouns since verb synonyms do not prove to be effective in query expansion [13]. While signature terms are weighted, we assume that all query terms are equally important and they all have wight equal to 1.
To estimate the utterance score, we view both the query terms and the signature terms as the terms that should appear in a human query-based summary. To achieve this, the most relevant (summary-worthy) utterances that we select are the ones that maximize the coverage of such terms. Given the query terms and signature terms, we can estimate the utterance score as follows:
(1) | ||||
(2) | ||||
(3) |
where is number of content words in the utterance, if the term is a query term and otherwise, and if is a signature term and otherwise, and is the normalized associated weight for signature terms. The parameters and are tuned on a development set and sum up to .
After all the utterances are scored, the top scored utterances are selected to be sent to the next step. We estimate the percentage of the retrieved utterances based on the development set.
Utterances selected in previous step often include redundant information, which is semantically equivalent but may vary in lexical choices. By identifying the semantic relations between the sentences, we can discover what information in one sentence is semantically equivalent, novel, or more/less informative with respect to the content of the other sentences. Similar to earlier work [3, 1], we set this problem as a variant of the Textual Entailment (TE) recognition task [5]. Using entailment in this phase is motivated by taking advantage of semantic relations instead of pure statistical methods (e.g., Maximal Marginal Relevance) and shown to be more effective [19]. We follow the same practice as [19] to build an entailment graph for all selected sentences to identify relevant sentences and eliminate the redundant (in terms of meaning) and less informative ones.
In this phase, our goal is to generate understandable informative abstract sentences that capture the content of the source sentences and represents the information needs defined by queries. There are several ways of generating abstract sentences (e.g. [2, 18, 8, 23]); however, most of them rely heavily on the sentence structure. We believe that such approaches are suboptimal, especially in dealing with conversational data, because multiparty written conversations are often poorly structured. Instead, we apply an approach that does not rely on syntax, nor on a standard NLG architecture. Moreover, since dealing with user queries efficiency is an important aspect, we aim for an approach that is also motivated by the speed with which the abstracts are obtained. We perform the task of abstract generation in three steps, as follows:
In order to generate an abstract summary, we need to identify which sentences from the previous step (i.e., redundancy removal) can be clustered and combined in generated abstract sentences. This task can be viewed as sentence clustering, where each sentence cluster can provide the content for an abstract sentence.
We use the K-mean clustering algorithm by cosine similarity as a distance function between sentence vectors composed of tf.idf scores. Also notice that the lexical similarity between sentences in one cluster facilitates both the construction of the word graph and finding the best path in the word graph, as described next.
In order to construct a word graph, we adopt the method recently proposed by [19, 7] with some optimizations. Below, we show how the word graph is applied to generate the abstract sentences.
Let be a directed graph with the set of nodes representing words and a set of directed edges representing the links between words. Given a cluster of related sentences , a word graph is constructed by iteratively adding sentences to it. In the first step, the graph represents one sentence plus the start and end symbols. A node is added to the graph for each word in the sentence, and words adjacent are linked with directed edges. When adding a new sentence, a word from the sentence is merged in an existing node in the graph providing that they have the same POS tag and they satisfy one of the following conditions:
i) They have the same word form;
ii) They are connected in WordNet by the synonymy relation. In this case the lexical choice for the node is selected based on the tf.idf score of each node;
iii) They are from a hypernym/hyponym pair or share a common direct hypernym. In this case, both words are replaced by the hypernym;
iv) They are in an entailment relation. In this case, the entailing word is replaced by the entailed one.
The motivation behind merging non-identical words is to enrich the common terms between the phrases to increase the chance that they could merge into a single phrase. This also helps to move beyond the limitation of original lexical choices. In case the merging is not possible a new node is created in the graph. When a node can be merged with multiple nodes (i.e., merging is ambiguous), either the preceding and following words in the sentence and the neighboring nodes in the graph or the frequency is used to select the candidate node.
We connect adjacent words with directed edges. For the new nodes or unconnected nodes, we draw an edge with a weight of . In contrast, when two already connected nodes are added (merged), the weight of their connection is increased by .
A word graph, as described above, may contain many sequences connecting start and end. However, it is likely that most of the paths are not readable. We are aiming at generating an informative abstractive sentence for each cluster based on a user query. Moreover, the abstract sentence should be grammatically correct.
In order to satisfy both requirements, we have devised the following ranking strategy. First, we prune the paths in which a verb does not exist, to filter ungrammatical sentences. Then we rank other paths as follows:
Query focus: to identify the summary sentence with the highest coverage of query content, we propose a score that counts the number of query terms that appear in the path. In order to reward the ranking score to cover more salient terms in the query content, we also consider the tf.idf score of query terms in the coverage formulation.
where the are the query terms.
Fluency: in order to improve the grammaticality of the generated sentence, we coach our ranking model to select more fluent (i.e., grammatically correct) paths in the graph. We estimate the grammaticality of generated paths () using a language model.
Path weight: The purpose of this function is two-fold: i) to generate a grammatical sentence by favoring the links between nodes (words) which appear often; and ii) to generate an informative sentence by increasing the weight of edges connecting salient nodes. For a path with m nodes, we define the edge weight and the path weight as below:
where the function refers to the distance between the offset positions of nodes and in path (any path in containing and ) and is defined as .
Overal ranking score: In order to generate a query-based abstract sentence that combines the scores above, we employ a ranking model. The purpose of such a model is three-fold: i) to cover the content of query information optimally; ii) to generate a more readable and grammatical sentence; and iii) to favor strong connections between the concepts. Therefore, the final ranking score of path is calculated over the normalized scores as:
Where , and are the coefficient factors to tune the ranking score and they sum up to . In order to rank the graph paths, we select all the paths that contain at least one verb and rerank them using our proposed ranking function to find the best path as the summary of the original sentences in each cluster.
In this section, we show the evaluation results of our proposed framework and its comparison to the baselines and a state-of-the-art query-focused extractive summarization system.
One of the challenges of this work is to find suitable conversational datasets that can be used for evaluating our query-based summarization system. Most available conversational corpora do not contain any human written summaries, or the gold standard human written summaries are generic [4, 16]. In this work, we use available corpora for emails and chats for written conversations, while for spoken conversation, we employ an available corpus in multiparty meeting conversations.
Chat: to the best of our knowledge, the only publicly available chat logs with human written summaries can be downloaded from the GNUe Traffic archive [32, 27, 28]. Each chat log has a human created summary in the form of a digest. Each digest summarizes IRC logs for a period and consists of few summaries over each chat log with a unique title for the associated human written summary. In this way, the title of each summary can be counted as a phrasal query and the corresponding summary is considered as the query-based abstract of the associated chat log including only the information most relevant to the title. Therefore, we can use the human-written query-based abstract as gold standards and evaluate our system automatically. Our chat dataset consists of 66 query-based (title-based) human written summaries with their associated queries (titles) and chat logs, created from 40 original chat logs. The average number of tokens are 1840, 325 and 6 for chat logs, query-based summaries and queries, respectively.
Meeting: we use the AMI meeting corpus [4] that consists of 140 multiparty meetings with a wide range of annotations, including generic abstractive summaries for each meeting. In order to create queries, we extract three key-phrases from generic abstractive summaries using TextRank algorithm [22]. We use the extracted key-phrases as queries to generate query-based abstracts. Since there is no human-written query-based summary for AMI corpus, we randomly select 10 meetings and evaluate our system manually.
Email: we use BC3 [26], which contains 40 threads from the W3C corpus. BC3 corpus is annotated with generic human-written abstractive summaries, and it has been used in several previous works (e.g., [15]). In order to adapt this corpus to our framework, we followed the same query generation process as for the meeting dataset. Finally, we randomly select 10 emails threads and evaluate the results manually.
We compare our approach with the following baselines:
1) Cosine-1st: we rank the utterances in the chat log based on the cosine similarity between the utterance and query. Then, we select the first uttrance as the summary;
2) Cosine-all: we rank the utterances in the chat log based on the cosine similarity between the utterance and query and then select the utterances with a cosine similarity greater than ;
3) TextRank: a widely used graph-based ranking model for single-document sentence extraction that works by building a graph of all sentences in a document and use similarity as edges to compute the salience of sentences in the graph [22];
4) LexRank: another popular graph-based content selection algorithm for multi-document summarization [6];
5) Biased LexRank: is a state-of-the-art query-focused summarization that uses LexRank algorithm in order to recursively retrieve additional passages that are similar to the query, as well as to the other nodes in the graph [24].
Models | ROUGE-1 (%) | ROUGE-2 (%) | ||||
Prc | Rec | F-1 | Prc | Rec | F-1 | |
Cosine-1st | 71 | 5 | 8 | 30 | 3 | 5 |
Cosine-all | 30 | 68 | 38 | 18 | 40 | 22 |
TextRank | 25 | 76 | 34 | 15 | 44 | 20 |
LexRank | 36 | 50 | 37 | 14 | 20 | 15 |
Biased LexRank | 36 | 51 | 38 | 15 | 21 | 16 |
Utterance extraction (our extractive system) | 34 | 66 | 40 | 20 | 40 | 24 |
Utterance extraction (our pipeline extractive system) | 30 | 73 | 38 | 19 | 44 | 24 |
Our abstractive system (without tuning) | 38 | 59 | 41 | 18 | 27 | 19 |
Our abstractive system (with tuning) | 40 | 56 | 42 | 20 | 25 | 22 |
Moreover, we compare our abstractive system with the first part of our framework (utterance extraction in Figure 1), which can be presented as an extractive query-based summarization system (our extractive system). We also show the results of the version we use in our pipeline (our pipeline extractive system). The only difference between the two versions is the length of the generated summaries. In our pipeline we aim at higher recall, since we later filter sentences and aggregate them to generate new abstract sentences. In contrast, in the stand alone version (extractive system) we limit the number of retrieved sentences to the desired length of the summary. We also compare the results of our full system (i.e., with tuning) with a non-optimized version when the ranking coefficients are distributed equally (). For parameters estimation, we tune all parameters (utterance selection and path ranking) exhaustively with 0.1 intervals using our development set.
For manual evaluation of query-based abstracts (meeting and email datasets), we perform a simple user study assessing the following aspects: i) Overall quality given a query (5-point scale)?; and ii) Responsiveness: how responsive is the generated summary to the query (5-point scale)? Each query-based abstract was rated by two annotators (native English speaker). Evaluators are presented with the original conversation, query and generated summary. For the manual evaluation, we only compare our full system with LexRank (LR) and Biased LexRank (Biased LR). We also ask the evaluators to select the best summary for each query and conversation, given our system generated summary and the two baselines.
To evaluate the grammaticality of our generated summaries, following common practice [2], we randomly selected 50 sentences from original conversations and system generated abstracts, for each dataset. Then, we asked annotators to give one of three possible ratings for each sentence based on grammaticality: perfect (2 pts), only one mistake (1 pt) and not acceptable (0 pts), ignoring capitalization or punctuation. Each sentence was rated by two annotators. Note that each sentence was evaluated individually, so the human judges were not affected by intra-sentential problems posed by coreference and topic shifts.
For preprocessing our dataset we use OpenNLP33http://opennlp.apache.org/ for tokenization, stemming and part-of-speech tagging. We use six randomly selected query-logs from our chat dataset (about 10% of the dataset) for tuning the coefficient parameters. We set the parameter in our clustering phase to based on the average number of sentences in the human written summaries. For our language model, we use a tri-gram smoothed language model trained using the newswire text provided in the English Gigaword corpus [11]. For the automatic evaluation we use the official ROUGE software with standard options and report ROUGE-1 and ROUGE-2 precision, recall and F-1 scores.
Query: Trigger namespace and the self property |
---|
Chat log: |
A: good morning |
B: good morning |
C: good morning everyone |
D: good morning |
D: good night all |
F: New GNUe Traffic online |
F: loadsa deep metaphyisical stuff this week |
F: D & E discuss the meaning of ’self’ ;-) |
E: yes, and he took the more metaphysical route, where I took the more scientific route |
E: I say self’s meaning is derived from one’s ancestry |
E: self’s meaning is derived from how others use you |
E: okay, analogy extended too far, I guess :) |
F: is this a friends vs family debate? |
E: also noted that the cool part about that is if you have code that needs to happen both on a pre-insert and a pre-update - but only a few lines of the code is different |
E: you could have one block of trigger code that used self.action to find out why it had been called and branch accordingly. |
E: there was a big jump from the previous paragraph to that |
E: that took that out of context |
E: iirc, I was saying an alternative was that ”self” could refer to neither the trigger’s owner nor to the trigger’s caller |
E: but to the event itself |
E: so self.action could be what action is being performed |
E: self.parent could be the trigger’s owner |
E: and self.context could be the caller |
E: and self.name could be the name of the trigger |
F: umm, I’m afraid apparant non-sequiturs are always a hazard of doing summaries ;-) |
E: iow, I was presenting a 3rd alternative where self referred to something specific to the trigger |
F: I normally try to write around them, but not 100% successful |
E: I’m just convulsing my thoughts to the irc log |
E: for an errata next week :) |
E: I think convulsing is a good description |
F: heh |
TexRank: There was a big jump from the previous paragraph to that. that took that out of context iirc, I was saying an alternative was that ”self” could refer to neither the trigger’s owner nor to the trigger’s caller. but to the event itself. so self.action could be what action is being performed, self.parent could be the trigger’s owner, and self.context could be the caller and self.name could be the name of the trigger. umm, I’m afraid apparant non-sequiturs are always a hazard of doing summaries ;-) |
LexRank: good morning everyone. heh. I’m just convulsing my thoughts to the irc log good morning. Jason also noted that the cool part about that is if you have code that needs to happen both on a pre-insert and a pre-update - but only a few lines of the code is different - you could have one block of trigger code that used self.action to find out why it had been called and branch accordingly. for an errata next week :) self’s meaning is derived from how others use you. I think convulsing is a good description reinhard & jcater discuss the meaning of ’self’ ;-) |
Biased-LexRank: good morning everyone. heh. I’m just convulsing my thoughts to the irc log. Jason also noted that the cool part about that is if you have code that needs to happen both on a pre-insert and a pre-update - but only a few lines of the code is different - you could have one block of trigger code that used self.action to find out why it had been called and branch accordingly. yes, and he took the more metaphysical route, where I took the more scientific route there was a big jump from the previous paragraph to that but to the event itself. iow, I was presenting a 3rd alternative where self referred to something specific to the trigger. |
Our system: self could refer to neither the triggers owner nor caller. |
I was saying an alternative where self referred to something specific to the trigger. and self.name could be the name. |
so self.action could be what action is being performed, self.parent the triggers owner and self.context caller. |
Gold: Further to, E clarified that he had suggested that ”self” could refer to neither the trigger’s owner nor to the trigger’s caller - but to the event itself. So self.action could be what action is being performed, self.parent could be the trigger’s owner, and self.context could be the caller. In other words, I was presenting a 3rd alternative where self referred to something specific to the trigger. |
Example 3. Summaries generated by our system and other baselines in comparison with the human-written summary for a short chat log. Speaker information have been anonymized.
Abstractive vs. Extractive: our full query-based abstractive summariztion system show statistically significant improvements over baselines and other pure extractive summarization systems for ROUGE-144The statistical significance tests was calculated by approximate randomization, as described in [31].. This means our systems can effectively aggregate the extracted sentences and generate abstract sentences based on the query content. We can also observe that our full system produces the highest ROUGE-1 precision score among all models, which further confirms the success of this model in meeting the user information needs imposed by queries. The absolute improvement of 10% in precision for ROUGE-1 in our abstractive model over our extractive model (our pipeline) further confirms the effectiveness of our ranking method in generating the abstract sentences considering the query related information.
Dataset | Overal Quality | Responsiveness | Preference | ||||||
---|---|---|---|---|---|---|---|---|---|
Our Sys | Biased LR | LR | Our Sys | Biased LR | LR | Our Sys | Biased LR | LR | |
Meeting | 2.9 | 2.5 | 2.1 | 3.8 | 3.2 | 1.8 | 70% | 30% | 0% |
2.7 | 1.8 | 1.7 | 3.7 | 3.0 | 1.5 | 60% | 30% | 10% |
Dataset | Grammar | G=2 | G=1 | G=0 | ||||
---|---|---|---|---|---|---|---|---|
Orig | Sys | Orig | Sys | Orig | Sys | Orig | Sys | |
Chat | 1.8 | 1.6 | 84% | 73% | 16% | 24% | 0% | 3% |
Meeting | 1.5 | 1.3 | 50% | 40% | 50% | 55% | 0% | 5% |
1.9 | 1.6 | 85% | 60% | 15% | 35% | 0% | 5% |
Our extractive query-based method beats all other extractive systems with a higher ROUGE-1 and ROUGE-2 which shows the effectiveness of our utterance extraction model in comparison with other extractive models. In other words, using our extractive model described in section 2.1, as a stand alone system, is an effective query-based extractive summarization model. We also observe that our extractive model outperforms our abstractive model for ROUGE-2 score. This can be due to word merging and word replacement choices in the word graph construction, which sometimes change or remove a word in a bigram and consequently may decrease the bigram overlap score.
Query Relevance: another interesting observation is that relying only on the cosine similarity (i.e., cosine-all) to measure the query relevance presents a quite strong baseline. This proves the importance of query content in our dataset and further supports the main claim of our work that a good summary should express a brief and well-organized abstract that answers the user’s query. Moreover, a precision of 71% for ROUGE-1 from the simple cosine-1st baseline confirms that some utterances contain more query relevant information in conversational discussions.
Query-based vs. Generic: the high recall and low precision in TextRank baseline, both for the ROUGE-1 and ROUGE-2 scores, shows the strength of the model in extracting the generic information from chat conversations while missing the query-relevant content. The LexRank baseline improves the results of the TextRank system by increasing the precision and balancing the precision and recall scores for ROUGE-1 score. We believe that this is due to the robustness of the LexRank method in dealing with noisy texts (chat conversations) [6]. In addition, the Biased LexRank model slightly improves the generic LexRank system. Considering this marginal improvement and relatively high results of pure extractive systems, we can infer that the Biased LexRank extracted summaries do not carry much query relevant content. In contrast, the significant improvement of our model over the extractive methods demonstrates the success of our approach in presenting the query related content in generated abstracts.
An example of a short chat log, its related query and corresponding manual and automatic summaries are shown in Example 3.
Content and User Preference: Table 2 demonstrates overall quality, responsiveness (query relatedness) and user preference scores for the abstracts generated by our system and two baselines. Results indicate that our system significantly outperforms baselines in overall quality and responsiveness, for both meeting and email datasets. This confirms the validity of the results we obtained by conducting automatic evaluation over the chat dataset. We also can observe that the absolute improvements in overall quality and responsiveness for emails (0.9 and 0.7) is greater than for meetings (0.4 and 0.6). This is expected since dealing with spoken conversations is more challenging than written ones. Note that the responsiveness scores are greater than overall scores. This further proves the effectiveness of our approach in dealing with phrasal queries. We also evaluate the users’ summary preferences. For both datasets (meeting and email), in majority of cases (70% and 60% respectively), the users prefer the query-based abstractive summary generated by our system.
Grammaticality: Table 3 shows grammaticality scores and distributions over the three possible scores for all datasets. The chat dataset results demonstrate the highest scores: 73% of the sentences generated by our phrasal query abstraction model are grammatically correct and 24% of the generated sentences are almost correct with only one grammatical error, while only 3% of the abstract sentences are grammatically incorrect. However, the results varies moving to other datasets. For meeting dataset, the percentage of completely grammatical sentences drops dramatically. This is due to the nature of spoken conversations which is more error prone and ungrammatical. The grammaticality score of the original sentences also proves that the sentences from meeting transcripts, although generated by humans, are not fully grammatical. In comparison with the original sentences, for all datasets, our model reports slightly lower results for the grammaticality score. Considering the fact that the abstract sentences are automatically generated and the original sentences are human-written, the grammaticality score and the percentage of fully grammatical sentences generated by our system, with higher ROUGE or quality scores in comparison with other methods, demonstrates that our system is an effective phrasal query abstraction framework for both spoken and written conversations.
We have presented an unsupervised framework for abstractive summarization of spoken and written conversations based on phrasal queries. For content selection, we propose a sentence extraction model that incorporates query relevance and content importance into the extraction process. For the generation phase, we propose a ranking strategy which selects the best path in the constructed word graph based on fluency, query relevance and content. Both automatic and manual evaluation of our model show substantial improvement over extraction-based methods, including Biased LexRank, which is considered a state-of-the-art system. Moreover, our system also yields good grammaticality score for human evaluation and achieves comparable scores with the original sentences. Our future work is four-fold. First, we are trying to improve our model by incorporating conversational features (e.g., speech acts). Second, we aim at implementing a strategy to order the clusters for generating more coherent abstracts. Third, we try to improve our generated summary by resolving coreferences and incorporating speaker information (e.g., names) in the clustering and sentence generation phases. Finally, we plan to take advantage of topic shifts to better segment the relevant parts of conversations in relation to phrasal queries.
We would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the paper, and the NSERC Business Intelligence Network for financial support. We also would like to acknowledge the early discussions on the related topics with Frank Tompa.