In this paper, we propose a novel model for enriching the content of microblogs by exploiting external knowledge, thus improving the data sparseness problem in short text classification. We assume that microblogs share the same topics with external knowledge. We first build an optimization model to infer the topics of microblogs by employing the topic-word distribution of the external knowledge. Then the content of microblogs is further enriched by relevant words from external knowledge. Experiments on microblog classification show that our approach is effective and outperforms traditional text classification methods.
During the past decade, the short text representation has been intensively studied. Previous researches [Phan et al.2008, Guo and Diab2012] show that while traditional methods are not so powerful due to the data sparseness problem, some semantic analysis based approaches are proposed and proved effective, and various topic models are among the most frequently used techniques in this area. Meanwhile, external knowledge has been found helpful [Hu et al.2009] in tackling the data scarcity problem by enriching short texts with informative context. Well-organized knowledge bases such as Wikipedia and WordNet are common tools used in relevant methods.
Nowadays, most of the work on short text focuses on microblog. As a new form of short text, microblog has some unique features like informal spelling and emerging words, and many microblogs are strongly related to up-to-date topics as well. Every day, a great quantity of microblogs more than we can read is pushed to us, and finding what we are interested in becomes rather difficult, so the ability of choosing what kind of microblogs to read is urgently demanded by common user. Such ability can be implemented by effective short text classification.
Treating microblogs as standard texts and directly classifying them cannot achieve the goal of effective classification because of sparseness problem. On the other hand, news on the Internet is of information abundance and many microblogs are news-related. They share up-to-date topics and sometimes quote each other. Thus, external knowledge, such as news, provides rich supplementary information for analysing and mining microblogs.
Motivated by the idea of using topic model and external knowledge mentioned above, we present an LDA-based enriching method using the news corpus, and apply it to the task of microblog classification. The basic assumption in our model is that news articles and microblogs tend to share the same topics. We first infer the topic distribution of each microblog based on the topic-word distribution of news corpus obtained by the LDA estimation. With the above two distributions, we then add a number of words from news as additional information to microblogs by evaluating the relatedness of between each word and microblog, since words not appearing in the microblog may still be highly relevant.
To sum up, our contributions are:
We formulate the topic inference problem for short texts as a convex optimization problem.
We enrich the content of microblogs by inferring the association between microblogs and external words in a probabilistic perspective.
We evaluate our method on the real datasets and experiment results outperform the baseline methods.
Based on the idea of exploiting external knowledge, many methods are proposed to improve the representation of short texts for classification and clustering. Among them, some directly utilize the structure information of organized knowledge base or search engine. Banerjee et al.2007 use the title and the description of news article as two separate query strings to select related concepts as additional feature. Hu et al.2009 present a framework to improve the performance of short text clustering by mining informative context with the integration of Wikipedia and WordNet.
However, to better leverage external resource, some other methods introduce topic models. Phan et al.2008 present a framework including an approach for short text topic inference and adds abstract words as extra feature. Guo and Diab2012 modify classic topic models and proposes a matrix-factorization based model for sentence similarity calculation tasks.
Those methods without topic model usually rely greatly on the performance of search system or the completeness of knowledge base, and lack in-depth analysis for external resources. Compared with our method, the topic model based methods mentioned above remain in finding latent space representation of short text and ignore that relevant words from external knowledge are informative as well.
We formulate the problem as follows. Let denote external knowledge consisting of documents. represents its vocabulary. Let denote microblog set and its vocabulary is . Our task is to enrich each microblog with additional information so as to improve microblog’s representation.
The model we proposed mainly consists of three steps:
Topic inference for external knowledge by running LDA estimation.
Topic inference for microblogs by employing the word distributions of topics obtained from step (a).
Select relevant words from external knowledge to enrich the content of microblogs.
We do topic analysis for using LDA estimation [Blei et al.2003] in this section and we choose LDA as the topic analysis model because of its broadly proved effectivity and ease of understanding.
In LDA, each document has a distribution over all topics , and each topic has a distribution over all words , where , and represent the topic, document and word respectively. The optimization problem is formulated as maximizing the log likelihood on the corpus:
(1) |
In this formulation, represents the term frequency of word in document . and are parameters to be inferred, corresponding to the topic distribution of each document and the word distribution of each topic respectively. Estimating parameters for LDA by directly and exactly maximizing the likelihood of the corpus in (1) is intractable, so we use Gibbs Sampling for estimation.
After performing LDA model ( topics) estimation on , we obtain the topic distributions of document (), denoted as (), and the word distribution of topic (), denoted as (). Step (b) greatly relies on the word distributions of topics we have obtained here.
In this section, we infer the topic distribution of each microblog. Because of the assumption that microblogs share the same topics with external corpus, the “topic distribution” here refers to a distribution over all topics on .
Differing from step (a), the method used for topic inference for microblogs is not directly running LDA estimation on microblog collection but following the topics from external knowledge to ensure topic consistence. We employ the word distributions of topics obtained from step (a), i.e. , and formulate the optimization problem in a similar form to Formula (1) as follows:
(2) |
where represents the term frequency of word in microblog , and denote the distribution of microblog over all topics on . Obviously most are zero and we ignore those words that do not appear in .
Compared with the original LDA optimization problem (1), the topic inference problem for microblog (2) follows the idea of document generation process, but replaces topics to be estimated with known topics from other corpus. As a result, parameters to be inferred are only the topic distribution of every microblog.
It is noteworthy that since the word distribution of every topic is known, Formula (2) can be further solved by separating it into subproblems:
(3) |
These subproblems correspond to the microblogs and can be easily proved convexity. After solving them, we obtain the topic distributions of microblog (), denoted as ().
To enrich the content of every microblog, we select relevant words from external knowledge in this section.
Based on the results of step (a)&(b), we calculate the word distributions of microblogs as follows:
(4) |
where represents the probability that word will appear in microblog . In other words, though some words may not actually appears in a microblog, there is still a probability that it is highly relevant to the microblog. Intuitively, this probability indicates the strength of association between a word and a microblog. The word distribution of every microblog is based on topic analysis and its accuracy relies heavily on the accuracy of topic inference in step (b). In fact, the more words a microblog includes, the more accurate its topic inference will be, and this can be regarded as an explanation of the low efficiency of data sparseness problem.
For microblog , we sort all words by in descending order. Having known the top relevant words according to the result of sorting, we redefine the “term frequency” of every word after adding these words to microblog as additional content. Supposing these words are , the revised term frequency of word is defined as follows:
(5) |
where is the revised term frequency.
As the Equation (5) shows, the revised term frequency of every word is proportional to probability rather than a constant.
So far, we can add these words and their revised term frequency as additional information to microblog . The revised term frequency plays the same role as TF in common text representation vector, so we calculate the TFIDF of the added words as:
(6) |
Note that is changed as arrival of new words for each microblog. The TFIDF vector of a microblog with additional words is called enhanced vector.
To evaluate our method, we build our own datasets. We crawl 95028 Chinese news reports from Sina News website††Sina News: http://news.sina.com.cn/, segment them, and remove stop words and rare words. After preprocessing, these news documents are used as external knowledge. As for microblog, we crawl a number of microblogs from Sina Weibo††Sina Weibo: http://www.weibo.com/, and ask unbiased assessors to manually classify them into 9 categories following the column setting of Sina News. After the manual classification, we remove short microblogs (less than words), usernames, links and some special characters, then we segment them and remove rare words as well. Finally, we get 1671 classified microblogs as our microblog dataset. The size of each category is shown in Table 1.
Category | #Microblog |
---|---|
Finance | 229 |
Stock | 80 |
Entertainment | 162 |
Military Affairs | 179 |
Technologies | 204 |
Digital Products | 194 |
Sports | 195 |
Society | 214 |
Daily Life | 214 |
There are some important details of our implementation. In step (a) of Section 3.1 we estimate LDA model using GibbsLDA++††GibbsLDA++: http://gibbslda.sourceforge.net, a C/C++ implementation of LDA using Gibbs Sampling. In step (b) of Section3.2, OPTI toolbox††OPTI Toolbox: http://www.i2c2.aut.ac.nz/Wiki/OPTI/ on Matlab is used to help solve the convex problems. In the classification tasks shown below, we use LibSVM††SVM.NET: http://www.matthewajohnson.org/software
/svm.html as classifier and perform ten-fold cross validation to evaluate the classification accuracy.
Representation | Average Accuracy |
---|---|
TFIDF vector | 0.7552 |
Boolean vector | 0.7203 |
Enhanced vector | 0.8453 |
In this section, we report the average precision of each method as shown in Table 2. The enhanced vector is the representation generated by our method. Two baselines are TFIDF vector [Jones1972] and boolean vector (word occurrence) of the original microblog. In the table, our method increases the classification accuracy from 75.52% to 84.53% when considering additional information, which means our method indeed improves the representation of microblogs.
Microblog (Translated) | Top Relevant Words (Translated) |
---|---|
Kim Jong Un held an emergency meeting this morning, and commanded the missile units to prepare for attacking U.S. military bases at any time. | South Korea, America, North Korea, work, safety, claim, military, exercise, united, report |
Shenzhou Nine will carry three astronauts, including the first Chinese female astronaut, and launch in a proper time during the middle of June. | day, satellite, launch, research, technology, system, mission, aerospace, success, Chang’e Two |
The experiment corresponding to Figure 1 is to discover how the classification accuracy changes when we fix the number of topics () and change the number of added words () in our method. Result shows that more added words do not mean higher accuracy. By studying some cases, we find out that if we add too many words, the proportion of “noisy words” will increase. We reach the best result when number of added words is 300.
The experiment corresponding to Figure 2 is to discover how the classification accuracy changes when we fix the number of added words () and change the number of topics () in our method. As we can see, the accuracy does not grow monotonously as the number of topics increases. Blindly enlarging the topic number will not improve the accuracy. The best result is reached when topic number is 100, and similar experiments adding different number of words show the same condition of reaching the best result.
The experiment corrsponding to Figure 3 is to discover whether our redefining “term frequency” as revised term frequency in step (c) of Section 3.3 will affect the classification accuracy and how. The results should be analysed in two aspects. On one hand, without redefinition, the accuracy remains in a stable high level and tends to decrease as we add more words. One reason for the decreasing is that “noisy words” have a increasing negative impact on the accuracy as the proportion of “noisy words” grows with the number of added words. On the other hand, the best result is reached when we use the revise term frequency. This suggests that our redefinition for term frequency shows better improvement for microblog representation under certain conditions, but is not optimal under all situations.
In Table 3, we select several cases consisting of microblogs and their top relevant words .
In the first case, we successfully find the country name according to its leader’s name and limited information in the sentence. Other related countries and events are also selected by our model as they often appear together in news. In the other case, relevant words are among the most frequently used words in news and have close semantic relations with the microblogs in certain aspects.
As we can see, based on topic analysis, our model shows strong ability of mining relevant words. Other cases show that the model can be further improved by removing the noisy and meaningless ones among added words.
We propose an effective content enriching method for microblog, to enhance classification accuracy. News corpus is exploited as external knowledge. As for techniques, our method uses LDA as its topic analysis model and formulates topic inference for new data as convex optimization problems. Compared with traditional representation, enriched microblog shows great improvement in classification tasks.
As we do not control the quality of added words, our future work starts from building a filter to select better additional information. And to make the most of external knowledge, better ways to build topic space should be considered.
This work is supported by National Natural Science Foundation of China (Grant No. 61170091).