This paper introduces an unsupervised graph-based method that selects textual labels for automatically generated topics. Our approach uses the topic keywords to query a search engine and generate a graph from the words contained in the results. PageRank is then used to weigh the words in the graph and score the candidate labels. The state-of-the-art method for this task is supervised [14]. Evaluation on a standard data set shows that the performance of our approach is consistently superior to previously reported methods.
Topic models [11, 2] have proved to be a useful way to represent the content of document collections, e.g. [4, 7, 8, 10, 20]. In these interfaces, topics need to be presented to users in an easily interpretable way. A common way to represent topics is as set of keywords generated from the terms with the highest marginal probabilities. For example, a topic about the global financial crisis could be represented by its top 10 most probable terms: financial, bank, market, government, mortgage, bailout, billion, street, wall, crisis. But interpreting such lists is not always straightforward, particularly since background knowledge may be required [5].
{`Description': `Microsoft will accelerate your journey to cloud computing with an agile and responsive datacenter built from your existing technology investments.',
`DisplayUrl': `www.microsoft.com/en-us/server-cloud/datacenter/virtualization.aspx',
`ID': `a42b0908-174e-4f25-b59c-70bdf394a9da',
`Title': `Microsoft | Server & Cloud | Datacenter | Virtualization ...',
`Url': `http://www.microsoft.com/en-us/server-cloud/datacenter/virtualization.aspx', ... }
Textual labels could assist with the interpretations of topics and researchers have developed methods to generate these automatically [17, 15, 14]. For example, a topic which has keywords school, student, university, college, teacher, class, education, learn, high, program, could be labelled as Education and a suitable label for the topic shown above would be Global Financial Crisis. Approaches that make use of alternative modalities, such as images [1], have also been proposed.
Mei et al. (2007) label topics using statistically significant bigrams identified in a reference collection. Magatti et al. (2009) introduced an approach for labelling topics that relied on two hierarchical knowledge resources labelled by humans, while Lau et al. (2010) proposed selecting the most representative word from a topic as its label. Hulpus et al. (2013) make use of structured data from DBpedia to label topics.
Lau et al. (2011) proposed a method for automatically labelling topics using information from Wikipedia. A set of candidate labels is generated from Wikipedia article titles by querying using topic terms. Additional labels are then generated by chunk parsing the article titles to identify n-grams that represent Wikipedia articles as well. Outlier labels (less relevant to the topic) are identified and removed. Finally, the top-5 topic terms are added to the candidate set. The labels are ranked using Support Vector Regression (SVR) [21] and features extracted using word association measures (i.e. PMI, t-test, and Dice coefficient), lexical features and search engine ranking. Lau et al. (2011) report two versions of their approach, one unsupervised (which is used as a baseline) and another which is supervised. They reported that the supervised version achieves better performance than a previously reported approach [17].
This paper introduces an alternative graph-based approach which is unsupervised and less computationally intensive than Lau et al. (2011). Our method uses topic keywords to form a query. A graph is generated from the words contained in the search results and these are then ranked using the PageRank algorithm [19, 18]. Evaluation on a standard data set shows that our method consistently outperforms the best performing previously reported method, which is supervised [14].
We use the topic keywords to query a search engine. We assume that the search results returned are relevant to the topic and can be used to identify and weigh relevant keywords. The most important keywords can be used to generate keyphrases for labelling the topic or weight pre-existing candidate labels.
We use the approach described by Lau et al. (2011) to generate candidate labels from Wikipedia articles. The 10 terms with the highest marginal probabilities in the topic are used to query Wikipedia and the titles of the articles retrieved used as candidate labels. Further candidate labels are generated by processing the titles of these articles to identify noun chunks and n-grams within the noun chunks that are themselves the titles of Wikipedia articles. Outlier labels, identified using a similarity measure [9], are removed. This method has been proved to produce labels which effectively summarise a topic’s main subject.
However, it should be noted that our method is flexible and could be applied to any set of candidate labels. We have experimented with various approaches to candidate label generation but chose to report results using the approach described by Lau et al. (2011) to allow direct comparison of approaches.
Information obtained from web searches is used to identify the best labels from the set of candidates. The top n keywords, i.e. those with highest marginal probability within the topic, are used to form a query which was submitted to the Bing11http://www.bing.com/ search engine. Textual information included in the Title field22We also experimented with using the Description field but found that this reduced performance. of the search results metadata was extracted. Each title was tokenised using openNLP33http://opennlp.apache.org/ and stop words removed.
Figure 1 shows a sample of the metadata associated with a search result for the topic: vmware, server, virtual, oracle, update, virtualization, application, infrastructure, management, microsoft.
We consider any remaining words in the search result metadata as nodes, , in a graph . Each node is connected to its neighbouring words in a context window of words. In the previous example, the words added to the graph from the Title of the search result are microsoft, server, cloud, datacenter and virtualization.
We consider both unweighted and weighted graphs. When the graph is unweighted we assume that all the edges have a weight . In addition, we weight the edges of the graph by computing the relatedness between two nodes, and , as their normalised Pointwise Mutual Information (NPMI) [3]. Word co-occurrences are computed using Wikipedia as a a reference corpus. Pairs of words are connected with edges only if avoiding connections between words co-occurring by chance and hence introducing noise.
Important terms are identified by applying the PageRank algorithm [19] in a similar way to the approach used by Mihalcea and Tarau (2004) for document keyphrase extraction. The PageRank score () over for a word () can be computed by the following equation:
(1) |
where denotes the set of vertices which are connected to the vertex . is the damping factor which is set to the default value of [19]. In standard PageRank all elements of the vector are the same, where is the number of nodes in the graph.
Given a candidate label containing keywords, we compute the score of by simply adding the PageRank scores of its constituent keywords:
(2) |
The label with the highest score amongst the set of candidates is selected to represent the topic. We also experimented with normalised versions of the score, e.g. mean of the PageRank scores. However, this has a negative effect on performance since it favoured short labels of one or two words which were not sufficiently descriptive of the topics. In addition, we expect that candidate labels containing words that do not appear in the graph (with the exception of stop words) are unlikely to be good labels for the topic. In these cases the score of the candidate label is set to 0. We also experimented with removing this restriction but found that it lowered performance.
Domain | Model | Top-1 Av. Rating | nDCG-1 | nDCG-3 | nDCG-5 |
---|---|---|---|---|---|
BLOGS | Lau et al. (2011)-U | 1.84 | 0.75 | 0.77 | 0.79 |
Lau et al. (2011)-S | 1.98 | 0.81 | 0.82 | 0.83 | |
PR | 2.05† | 0.83 | 0.84 | 0.83 | |
PR-NPMI | 2.08† | 0.84 | 0.84 | 0.83 | |
Upper bound | 2.45 | 1.00 | 1.00 | 1.00 | |
BOOKS | Lau et al. (2011)-U | 1.75 | 0.77 | 0.77 | 0.79 |
Lau et al. (2011)-S | 1.91 | 0.84 | 0.81 | 0.83 | |
PR | 1.98† | 0.86 | 0.88 | 0.87 | |
PR-NPMI | 2.01† | 0.87 | 0.88 | 0.87 | |
Upper bound | 2.29 | 1.00 | 1.00 | 1.00 | |
NEWS | Lau et al. (2011)-U | 1.96 | 0.80 | 0.79 | 0.78 |
Lau et al. (2011)-S | 2.02 | 0.82 | 0.82 | 0.84 | |
PR | 2.04† | 0.83 | 0.81 | 0.81 | |
PR-NPMI | 2.05† | 0.83 | 0.81 | 0.81 | |
Upper bound | 2.45 | 1.00 | 1.00 | 1.00 | |
PUBMED | Lau et al. (2011)-U | 1.73 | 0.75 | 0.77 | 0.79 |
Lau et al. (2011)-S | 1.79 | 0.77 | 0.82 | 0.84 | |
PR | 1.88†‡ | 0.80 | 0.80 | 0.80 | |
PR-NPMI | 1.90†‡ | 0.81 | 0.80 | 0.80 | |
Upper bound | 2.31 | 1.00 | 1.00 | 1.00 | |
We evaluate our method on the publicly available data set published by Lau et al. (2011). The data set consists of 228 topics generated using text documents from four domains, i.e. blog posts (BLOGS), books (BOOKS), news articles (NEWS) and scientific articles from the biomedical domain (PUBMED). Each topic is represented by its ten most probable keywords. It is also associated with candidate labels and human ratings denoting the appropriateness of a label given the topic. The full data set consists of approximately 6,000 candidate labels (27 labels per topic).
Our evaluation follows the framework proposed by Lau et al. (2011) using two metrics, i.e. Top-1 average rating and nDCG, to compare various labelling methods.
Top-1 average rating is the average human rating (between 0 and 3) assigned to the top-ranked label proposed by the system. This provides an indication of the overall quality of the label the system judges as the best one.
Normalised discounted cumulative gain (nDCG) [13, 6] compares the label ranking proposed by the system to the ranking provided by human annotators. The discounted cumulative gain at position , , is computed using the following equation:
(3) |
where is the relevance of the label to the topic in position . Then nDCG is computed as:
(4) |
where is the superviseed ranking of the image labels, in our experiments this is the ranking provided by the scores in the human annotated data set.
Our proposed model requires two parameters to be set: the context window size when connecting neighbouring words in the graph and the number of the search results considered when constructing the graph.
We experimented with different sizes of context window, , between 1 words to the left and right and all words in the title. The best results were obtained when for all of the domains. In addition, we experimented with varying the number of search results between 10 and 300. We observed no noticeable difference in the performance when the number of search results is equal or greater than 30 (see below). We choose to report results obtained using 30 search results for each topic. Including more results did not improve performance but required additional processing.
Results are shown in Table 1. Performance when PageRank is applied to the unweighted (PR) and NPMI-weighted graphs (PR-NPMI) (see Section 2.2) is shown. Performance of the best unsupervised (Lau et al. (2011)-U) and supervised (Lau et al. (2011)-S) methods reported by Lau et al. (2011) are shown. Lau et al. (2011)-U uses the average scores between the topic keywords and the label keywords while Lau et al. (2011)-S uses SVR to combine evidence from all features. In addition, upper bound figures, the maximum possible value given the scores assigned by the annotators, are also shown.
The results obtained by applying PageRank over the unweighted graph (2.05, 1.98, 2.04 and 1.88) are consistently better than the supervised and unsupervised methods reported by Lau et al. (2011) for the Top-1 Average scores and this improvement is observed in all domains. The difference is significant (t-test, ) for the unsupervised method. A slight improvement in performance is observed when the weighted graph is used (2.08, 2.01, 2.05 and 1.90). This is expected since the weighted graph contains additional information about word relatedness. For example, the word hardware is more related and, therefore, closer in the graph to the word virtualization than to the word investments.
Results from the nDCG metric imply that our methods provide better rankings of the candidate labels in the majority of the cases. It is outperformed by the best supervised approach in two domains, NEWS and PUBMED, using the nDCG-3 and nDCG-5 metrics. However, the best label proposed by our methods is judged to be better (as shown by the nDCG-1 and Top-1 Av. Rating scores), demonstrating that it is only the lower ranked labels in our approach that are not as good as the supervised approach.
An interesting finding is that, although limited in length, the textual information in the search result’s metadata contain enough salient terms relevant to the topic to provide reliable estimates of term importance. Consequently, it is not necessary to measure semantic similarity between topic keywords and candidate labels as previous approaches have done. In addition, performance improvement gained from using the weighted graph is modest, suggesting that the computation of association scores over a large reference corpus could be omitted if resources are limited.
In Figure 6, we show the scores of Top-1 average rating obtained in the different domains by experimenting with the number of search results used to generate the text graph. The most interesting finding is that performance is stable when 30 or more search results are considered. In addition, we observe that quality of the topic labels in the four domains remains stable, and higher than the supervised method, when the number of search results used is between 150 and 200. The only domain in which performance of the supervised method is sometimes better than the approach proposed here is NEWS. The main reason is that news topics are more fine grained and the candidate labels of better quality [14] which has direct impact in good performance of ranking methods.
We described an unsupervised graph-based method to associate textual labels with automatically generated topics. Our approach uses results retrieved from a search engine using the topic keywords as a query. A graph is generated from the words contained in the search results metadata and candidate labels ranked using the PageRank algorithm. Evaluation on a standard data set shows that our method consistently outperforms the supervised state-of-the-art method for the task.