This paper presents a method for detecting words related to a topic (we call them topic words) over time in the stream of documents. Topic words are widely distributed in the stream of documents, and sometimes they frequently appear in the documents, and sometimes not. We propose a method to reinforce topic words with low frequencies by collecting documents from the corpus, and applied Latent Dirichlet Allocation [4] to these documents. For the results of LDA, we identified topic words by using Moving Average Convergence Divergence. In order to evaluate the method, we applied the results of topic detection to extractive multi-document summarization. The results showed that the method was effective for sentence selection in summarization.
As the volume of online documents has drastically increased, the analysis of topic bursts, topic drift or detection of topic is a practical problem attracting more and more attention [1, 23, 2, 13, 15, 9]. The earliest known approach is the work of Klinkenberg and Joachims [12]. They have attempted to handle concept changes by focusing a window with documents sufficiently close to the target concept. Mane et. al. proposed a method to generate maps that support the identification of major research topics and trends [17]. The method used Kleinberg’s burst detection algorithm, co-occurrences of words, and graph layout technique. Scholz et. al. have attempted to use different ensembles obtained by training several data streams to detect concept drift [22]. However the ensemble method itself remains a problem that how to manage several classifiers effectively. He and Parket attempted to find bursts, periods of elevated occurrence of events as a dynamic phenomenon instead of focusing on arrival rates [11]. However, the fact that topics are widely distributed in the stream of documents, and sometimes they frequently appear in the documents, and sometimes not often hamper such attempts.
This paper proposes a method for detecting topic over time in series of documents. We reinforced words related to a topic with low frequencies by collecting documents from the corpus, and applied Latent Dirichlet Allocation (LDA) [4] to these documents in order to extract topic candidates. For the results of LDA, we applied Moving Average Convergence Divergence (MACD) to find topic words while He et. al., applied it to find bursts. The MACD is a technique to analyze stock market trends [19]. It shows the relationship between two moving averages of prices modeling bursts as intervals of topic dynamics, i.e., positive acceleration. Fukumoto et. al also applied MACD to find topics. However, they applied it only to the words with high frequencies in the documents [10]. In contrast, we applied it to the topic candidates obtained by LDA.
We examined our method by extrinsic evaluation, i.e., we applied the results of topic detection to extractive multi-document summarization. We assume that a salient sentence includes words related to the target topic, and an event of each documents. Here, an event is something that occurs at a specific place and time associated with some specific actions[1]. We identified event words by using the traditional tfidf method applied to the results of named entities. Each sentence in documents is represented using a vector of frequency weighted words that can be event or topic words. We used Markov Random Walk (MRW) to compute the rank scores for the sentences [20]. Finally, we selected a certain number of sentences according to the rank score into a summary.
LDA presented by [4] models each document as a mixture of topics (we call it lda_topic to discriminate our candidates), and generates a discrete probability distribution over words for each lda_topic. The generative process for LDA can be described as follows:
For each topic = 1, , , generate , multinomial distribution of words specific to the topic from a Dirichlet distribution with parameter ;
For each document = 1, , , generate , multinomial distribution of topics specific to the document from a Dirichlet distribution with parameter ;
For each word = 1, , in document ;
Generate a topic of the word in the document from the multinomial distribution
Generate a word , the word associated with the word in document from multinomial
Like much previous work on LDA, we used Gibbs sampling to estimate and . The sampling probability for topic in document is given by:
(1) |
refers to a topic set , not including the current assignment . is the count of word in topic that does not include the current assignment , and indicates a summation over that dimension. refers to a set of documents, and denotes the total number of unique topics. After a sufficient number of sampling iterations, the approximated posterior can be used to estimate and by examining the counts of word assignments to topics and topic occurrences in documents. The approximated probability of topic in the document , , and the assignments word to topic , are given by:
(2) | |||||
(3) |
We used documents prepared by summarization tasks, NTCIR and DUC data as each task consists of series of documents with the same topic. We applied LDA to the set consisting of all documents in the summarization tasks and documents from the corpus. We need to estimate the appropriate number of lda_topic.
file=cluster.eps,height=5cm,width=7.5cm
Let be the number of lda_topics and be the number of topmost documents assigned to each lda_topic. We note that the result obtained by LDA can be regarded as the two types of clustering result shown in Figure 1: (i) each cluster corresponds to each lda_topic (topic id0, topic id1 in Figure 1), and each element of the clusters is the document in the summarization tasks (task1, task2, in Figure 1) or from the corpus (doc in Figure 1), and (ii) each cluster corresponds to the summarization task and each element of the clusters is the document in the summarization tasks or the document from the corpus assigned topic id. For example, DUC2005 consists of 50 tasks. Therefore the number of different clusters is 50. We call the former lda_topic cluster and the latter task cluster. We estimated and by using Entropy measure given by:
(4) |
refers to the number of clusters. is a probability that the elements of the cluster assigned to the correct class . denotes the total number of elements and shows the total number of elements assigned to the cluster . The value of ranges from 0 to 1, and the smaller value of indicates better result. Let and are entropy value of lda_topic cluster and task cluster, respectively. We chose the parameters and whose value of the summation of and is smallest. For each lda_topic, we extracted words whose probabilities are larger than zero, and regarded these as topic candidates.
The proposed method does not simply use MACD to find bursts, but instead determines topic words in series of documents. Unlike Dynamic Topic Models [3], it does not assume Gaussian distribution so that it is a natural way to analyze bursts which depend on the data. We applied it to extract topic words in series of documents. MACD histogram defined by Eq. (6) shows a difference between the MACD and its moving average. MACD of a variable is defined by the difference of -day and -day moving averages, MACD(,) = EMA() - EMA(). Here, EMA() refers to -day Exponential Moving Average (EMA). For a variable = which has a corresponding discrete time series x = { = 0,1,}, the -day EMA is defined by Eq. (5).
(5) | |||||
refers to a smoothing factor and it is often taken to be . MACD histogram shows a difference between the MACD and its moving average11In the experiment, we set , , and to 4, 8 and 5, respectively [11]..
(6) | |||||
The procedure for topic detection with MACD is illustrated in Figure 2. Let be a series of documents and be one of the topic candidates obtained by LDA. Each document in is sorted in chronological order. We set to the documents from the summarization task. Whether or not a word is a topic word is judged as follows:
file=topic_detect.eps,height=5cm,width=7.5cm
Create document-based MACD histogram where X-axis refers to , i.e., a period of time (numbered from day 1 to 365). Y-axis is the document count in per day. Hereafter, referred to as correct histogram.
Create term-based MACD histogram where X-axis refers to , and Y-axis denotes bursts of word in . Hereafter, referred to as bursts histogram.
We assume that if a term is informative for summarizing a particular documents in a collection, its burstiness approximates the burstiness of documents in the collection. Because is a representative word of each document in the task. Based on this assumption, we computed similarity between correct and word histograms by using KL-distance22We tested KL-distance, histogram intersection and Bhattacharyya distance to obtain similarities. We reported only the result obtained by KL-distance as it was the best results among them.. Let and be a normalized distance of correct histogram, and bursts histogram, respectively. KL-distance is defined by = where refers bursts in time . If the value of is smaller than a certain threshold value, is regarded as a topic word.
An event word is something that occurs at a specific place and time associated with some specific actions [2, 1]. It refers to notions of who(person), where(place), when(time) including what, why and how in a document. Therefore, we can assume that named entities(NE) are linguistic features for event detection. An event word refers to the of the document itself, and frequently appears in the document but not frequently appear in other documents. Therefore, we first applied NE recognition to the target documents to be summarized, and then calculated tfidf to the results of NE recognition. We extracted words whose tfidf values are larger than a certain threshold value, and regarded these as event words.
We recall that our hypothesis about key sentences in multiple documents is that they include topic and event words. Each sentence in the documents is represented using a vector of frequency weighted words that can be event or topic words.
Like much previous work on extractive summarization [7, 18, 25], we used Markov Random Walk (MRW) model to compute the rank scores for the sentences. Given a set of documents to be summarized, = (, ) is a graph reflecting the relationships between two sentences. is a set of vertices, and each vertex in is a sentence. is a set of edges, and each edge in is associated with an affinity weight between sentences and ( ). The affinity weight is computed using cosine measure between the two sentences, and . Two vertices are connected if their affinity weight is larger than 0 and we let = 0 to avoid self transition. The transition probability from to is then defined as follows:
(7) |
We used the row-normalized matrix = to describe with each entry corresponding to the transition probability, where = . To make a stochastic matrix, the rows with all zero elements are replaced by a smoothing vector with all elements set to . The final transition matrix is given by formula (8), and each score of the sentence is obtained by the principal eigenvector of the matrix .
(8) |
We selected a certain number of sentences according to rank score into the summary.
We applied the results of topic detection to extractive multi-document summarization task, and examined how the results of topic detection affect the overall performance of the salient sentence selection. We used two tasks, Japanese and English summarization tasks, NTCIR-333http://research.nii.ac.jp/ntcir/ SUMM Japanese and DUC44http://duc.nist.gov/pubs.html English data. The baselines are (i) MRW model (MRW): The method applies the MRW model only to the sentences consisted of noun words, (ii) Event detection (Event): The method applies the MRW model to the result of event detection, (iii) Topic Detection by LDA (LDA): MRW is applied to the result of topic candidates detection by LDA and (iv) Topic Detection by LDA and MACD (LDA & MACD): MRW is applied to the result of topic detection by LDA and MACD only, i.e., the method does not include event detection.
The data used in the NTCIR-3 multi-document summarization task is selected from 1998 to 1999 of Mainichi Japanese Newspaper documents. The gold standard data provided to human judges consists of FBFREE DryRun and FormalRun. Each data consists of 30 tasks. There are two types of correct summary according to the character length, “long” and “short”, All series of documents were tagged by CaboCha [14]. We used person name, organization, place and proper name extracted from NE recognition [14] for event detection, and noun words including named entities for topic detection. FBFREE DryRun data is used to tuning parameters, i.e., the number of extracted words according to the tfidf value, and the threshold value of KL-distance. The size that optimized the average Rouge-1(R-1) score across 30 tasks was chosen. As a result, we set tfidf and KL-distance to 100 and 0.104, respectively.
We used FormalRun as a test data, and another set consisted of 218,724 documents from 1998 to 1999 of Mainichi newspaper as a corpus used in LDA and MACD. We estimated the number of and in LDA, i.e., we searched and in steps of 100 from 200 to 900. Figure 3 illustrates entropy value against the number of topics and documents using 30 tasks of FormalRun data. Each plot shows that at least one of the documents for each summarization task is included in the cluster. We can see from Figure 3 that the value of entropy depends on the number of documents rather than the number of topics. From the result shown in Figure 3, the minimum entropy value was 0.025 and the number of topics and documents were 400 and 300, respectively. We used them in the experiment. The summarization results are shown in Table 1.
file=entropy.eps,height=4.5cm,width=7.5cm
Method | Short | Long |
---|---|---|
R-1 | R-1 | |
MRW | .369 | .454 |
Event | .625 | .724 |
LDA | .525 | .712 |
LDA & MACD | .630 | .742 |
Event & Topic | .678 | .744 |
Table 1 shows that our approach, “Event & Topic” outperforms other baselines, regardless of the summary type (long/short). Topic candidates include surplus words that are not related to the topic because the results obtained by “LDA” were worse than those obtained by “LDA & MACD”, and even worse than “Event” in both short and long summary. This shows that integration of LDA and MACD is effective for topic detection.
We used DUC2005 consisted of 50 tasks for training, and 50 tasks of DUC2006 data for testing in order to estimate parameters. We set tfidf and KL-distance to 80 and 0.9. The minimum entropy value was 0.050 and the number of topics and documents were 500 and 600, respectively. 45 tasks from DUC2007 were used to evaluate the performance of the method. All documents were tagged by Tree Tagger [21] and Stanford Named Entity Tagger 55http://nlp.stanford.edu/software/CRF-NER.shtml [8]. We used person name, organization and location for event detection, and noun words including named entities for topic detection. AQUAINT corpus66http://catalog.ldc.upenn.edu/LDC2002T31 which consists of 1,033,461 documents are used as a corpus in LDA and MACD. Table 2 shows Rouge-1 against unigrams.
Method | R-1 | Method | R-1 |
---|---|---|---|
MRW | .381 | Event | .407 |
LDA | .402 | LDA & MACD | .428 |
Event & Topic | .438 | ||
PYTHY | .426 | HybHSum | .456 |
hPAM | .412 | TTM | .447 |
We can see from Table 2 that Rouge-1 obtained by our approach was also the best compared to the baselines. Table 2 also shows the performance of other research sites reported by [5]. The top site was “HybHSum” by [5]. However, the method is a semi-supervised technique that needs a tagged training data. Our approach achieves performance approaching the top-performing unsupervised method, “TTM” [6], and is competitive to “PYTHY” [24] and “hPAM” [16]. Prior work including “TTM” has demonstrated the usefulness of semantic concepts for extracting salient sentences. For future work, we should be able to obtain further advantages in efficacy in our topic detection and summarization approach by disambiguating topic senses.
The research described in this paper explores a method for detecting topic words over time in series of documents. The results of extrinsic evaluation showed that integration of LDA and MACD is effective for topic detection.