Automatic extraction of new words is an indispensable precursor to many NLP tasks such as Chinese word segmentation, named entity extraction, and sentiment analysis. This paper aims at extracting new sentiment words from large-scale user-generated content. We propose a fully unsupervised, purely data-driven framework for this purpose. We design statistical measures respectively to quantify the utility of a lexical pattern and to measure the possibility of a word being a new word. The method is almost free of linguistic resources (except POS tags), and requires no elaborated linguistic rules. We also demonstrate how new sentiment word will benefit sentiment analysis. Experiment results demonstrate the effectiveness of the proposed method.
”zh”\XeTeXlinebreakskip=0ptplus1pt
New words on the Internet have been emerging all the time, particularly in user-generated content. Users like to update and share their information on social websites with their own language styles, among which new political/social/cultural words are constantly used.
However, such new words have made many natural language processing tasks more challenging. Automatic extraction of new words is indispensable to many tasks such as Chinese word segmentation, machine translation, named entity extraction, question answering, and sentiment analysis. New word detection is one of the most critical issues in Chinese word segmentation. Recent studies [17] [3] have shown that more than 60% of word segmentation errors result from new words. Statistics show that more than 1000 new Chinese words appear every year [19]. These words are mostly domain-specific technical terms and time-sensitive political/social /cultural terms. Most of them are not yet correctly recognized by the segmentation algorithm, and remain as out of vocabulary (OOV) words.
New word detection is also important for sentiment analysis such as opinionated phrase extraction and polarity classification. A sentiment phrase with complete meaning should have a correct boundary, however, characters in a new word may be broken up. For example, in a sentence ” 表演/n 非常/adv 给/v 力/n (artists’ performance is very impressive)” the two Chinese characters “给/v 力/n(cool; powerful)” should always be extracted together. In polarity classification, new words can be informative features for classification models. In the previous example, ”给力(cool; powerful)” is a strong feature for classification models while each single character is not. Adding new words as feature in classification models will improve the performance of polarity classification, as demonstrated later in this paper.
This paper aims to detect new word for sentiment analysis. We are particulary interested in extracting new sentiment word that can express opinions or sentiment, which is of high value towards sentiment analysis. New sentiment word, as exemplified in Table 1, is a sub-class of multi-word expressions which is a sequence of neighboring words ”whose exact and unambiguous meaning or connotation cannot be derived from the meaning or connotation of its components” [5]. Such new words cannot be directly identified using grammatical rules, which poses a major challenge to automatic analysis. Moreover, existing lexical resources never have adequate and timely coverage since new words appear constantly. People thus resort to statistical methods such as Pointwise Mutual Information [6], Symmetrical Conditional Probability [7], Mutual Expectation [8], Enhanced Mutual Information [22], and Multi-word Expression Distance [2].
New word | English Translation | Polarity |
---|---|---|
口爱 | lovely | positive |
杯具 | tragic/tragedy | negative |
给力 | very cool; powerful | positive |
坑爹 | reverse one’s expectation | negative |
Our central idea for new sentiment word detection is as follows: Starting from very few seed words (for example, just one seed word), we can extract lexical patterns that have strong statistical association with the seed words; the extracted lexical patterns can be further used in finding more new words, and the most probable new words can be added into the seed word set for the next iteration; and the process can be run iteratively until a stop condition is met. The key issues are to measure the utility of a pattern and to quantify the possibility of a word being a new word. The main contributions of this paper are summarized as follows:
We propose a novel framework for new word detection from large-scale user-generated data. This framework is fully unsupervised and purely data-driven, and requires very lightweight linguistic resources (i.e., only POS tags).
We design statistical measures to quantify the utility of a pattern and to quantify the possibility of a word being a new word, respectively. No elaborated linguistic rules are needed to filter undesirable results. This feature may enable our approach to be portable to other languages.
We investigate the problem of polarity prediction of new sentiment word and demonstrate that inclusion of new sentiment word benefits sentiment classification tasks.
The rest of the paper is structured as follows: we will introduce related work in the next section. We will describe the proposed method in Section 3, including definitions, the overview of the algorithm, and the statistical measures for addressing the two key issues. We then present the experiments in Section 4. Finally, the work is summarized in Section 5.
New word detection has been usually interweaved with word segmentation, particularly in Chinese NLP. In these works, new word detection is considered as an integral part of segmentation, where new words are identified as the most probable segments inferred by the probabilistic models; and the detected new word can be further used to improve word segmentation. Typical models include conditional random fields proposed by [14], and a joint model trained with adaptive online gradient descent based on feature frequency information [18].
Another line is to treat new word detection as a separate task, usually preceded by part-of-speech tagging. The first genre of such studies is to leverage complex linguistic rules or knowledge. For example, Justeson and Katz [11] extracted technical terminologies from documents using a regular expression. Argamon et al. [1] segmented the POS sequence of a multi-word into small POS tiles, counted tile frequency in the new word and non-new-word on the training set respectively, and detected new words using these counts. Chen and Ma [4] employed morphological and statistical rules to extract Chinese new word. The second genre of the studies is to treat new word detection as a classification problem. Zhou [25] proposed a discriminative Markov Model to detect new words by chunking one or more separated words. In [12], new word detection was viewed as a binary classification problem. However, these supervised models requires not only heavy engineering of linguistic features, but also expensive annotation of training data.
User behavior data has recently been explored for finding new words. Zheng et al. [24] explored user typing behaviors in Sogou Chinese Pinyin input method to detect new words. Zhang et al. [23] proposed to use dynamic time warping to detect new words from query logs. However, both of the work are limited due to the public unavailability of expensive commercial resources.
Statistical methods for new word detection have been extensively studied, and in some sense exhibit advantages over linguistics-based methods. In this setting, new word detection is mostly known as multi-word expression extraction. To measure multi-word association, the first model is Pointwise Mutual Information (PMI) [6]. Since then, a variety of statistical methods have been proposed to measure -gram association, such as Log-likelihood [9] and Symmetrical Conditional Probability (SCP) [7]. Among all the 84 -gram association measures, PMI has been reported to be the best one in Czech data [13]. In order to measure arbitrary -grams, most common strategies are to separate -gram into two parts X and Y so that existing -gram methods can be used [7, 8, 16]. Zhang et al. [22] proposed Enhanced Mutual Information (EMI) which measures the cohesion of -gram by the frequency of itself and the frequency of each single word. Based on the information distance theory, Bu et al. [2] proposed multi-word expression distance (MED) and the normalized version, and reported superior performance to EMI, SCP, and other measures.
Words that are used mainly to modify a verb or an adjective, such as ”太(too)”, ”非常(very)”, ”十分(very)”, and ”特别(specially)”.
Words that are auxiliaries, model particles, or punctuation marks. In Chinese, such words are like ”着,了,啦,的,啊”, and punctuation marks include ”,。!?;:” and so on.
A lexical pattern is a triplet , where is an adverbial word, the wildcard means an arbitrary number of words 11We set the number to 3 words in this work considering computation costs. , and denotes an auxiliary word.
Table 2 gives some examples of lexical patterns. In order to obtain lexical patterns, we can define regular expressions with POS tags 22Such expressions are very simple and easy to write because we only need to consider POS tags of adverbial and auxiliary word. and apply the regular expressions on POS tagged texts. Since the tags of adverbial and auxiliary words are relatively static and can be easily identified, such a method can safely obtain lexical patterns.
Pattern | Frequency |
---|---|
¡”都”,*,”了”¿ | 562,057 |
¡”都”,*,”的”¿ | 387,649 |
¡”太”,*,”了”¿ | 380,470 |
¡”不”,*,”,”¿ | 369,702 |
The algorithm works as follows: starting from very few seed words (for example, a word in Table 1), the algorithm can find lexical patterns that have strong statistical association with the seed words in which the likelihood ratio test (LRT) is used to quantify the degree of association. Subsequently, the extracted lexical patterns can be further used in finding more new words. We design several measures to quantify the possibility of a candidate word being a new word, and the top-ranked words will be added into the seed word set for the next iteration. The process can be run iteratively until a stop condition is met. Note that we do not augment the pattern set () at each iteration, instead, we keep a fixed small number of patterns during iteration because this strategy produces optimal results.
From linguistic perspectives, new sentiment words are commonly modified by adverbial words and thus can be extracted by lexical patterns. This is the reason why the algorithm will work. Our algorithm is in spirit to double propagation [15], however, the differences are apparent in that: firstly, we use very lightweight linguistic information (except POS tags); secondly, our major contributions are to propose statistical measures to address the following key issues: first, to measure the utility of lexical patterns; second, to measure the possibility of a candidate word being a new word.
[t]
\KwIn
: a large set of POS tagged posts
: a set of seed words
: the number of patterns chosen at each iteration
: the number of patterns in the candidate pattern set
: the number of words added at each iteration
: the number of words returned
\KwOutA list of ranked new words
Obtain all lexical patterns using regular expressions on
Count the frequency of each lexical pattern and extract words matched by each pattern
Obtain top frequent patterns as candidate pattern set and top 5,000 frequent words as
candidate word set
; =; t = 0
\For
Use to score each pattern in with
Use to extract new words and if the words are in , score them with
= -
Sort words in with Output the ranked list of words in
The first key issue is to quantify the utility of a pattern at each iteration. This can be measured by the association of a pattern to the current word set used in the algorithm. The likelihood ratio tests [9] is used for this purpose. This association model has also been used to model association between opinion target words by [10].
The LRT is well known for not relying critically on the assumption of normality, instead, it uses the asymptotic assumption of the generalized likelihood ratio. In practice, the use of likelihood ratios tends to result in significant improvements in text-analysis performance.
In our problem, LRT computes a contingency table of a pattern and a word , derived from the corpus statistics, as given in Table 3, where is the number of documents that matches pattern , is the number of documents that occurs while does not, is the number of documents that occurs while does not, and is the number of documents containing neither nor .
Statistics | ||
---|---|---|
Based on the statistics shown in Table 3, the likelihood ratio tests (LRT) model captures the statistical association between a pattern and a word by employing the following formula:
(1) |
where:
; ; ;
; ; .
Thus, the utility of a pattern can be measured as follows:
(2) |
where is the current word set used in the algorithm (see Algorithm 1).
Another key issue in the proposed algorithm is to quantify the possibility of a candidate word being a new word. We consider several factors for this purpose.
Very similar to the pattern utility measure, LRT can also be used to measure the association of a candidate word to a given pattern set, as follows:
(3) |
where is the current pattern set used in the algorithm (see Algorithm 1), and is a lexical pattern.
This measure only quantifies the association of a candidate word to the given pattern set. It tells nothing about the possibility of a word being a new word, however, a new sentiment word, should have close association with the lexical patterns. This has linguistic interpretations because new sentiment words are commonly modified by adverbial words and thus should have close association with lexical patterns. This measure is proved to be an influential factor by our experiments in Section 4.3.
If a candidate word is a new word, it will be more commonly used with diversified lexical patterns since the non-compositionality of new word means that the word can be used in many different linguistic scenarios. This can be measured by information entropy, as follows:
(4) |
where is the set of left word of all patterns by which word can be matched in , is the count that word can be matched by patterns whose left word is , and is the count that word can be matched by the patterns in . Note that we use , instead of , because the latter set is very small while computing entropy needs a large number of patterns. Tuning the size of will be further discussed in Section 4.4.
Some words occur very frequently and can be widely matched by lexical patterns, but they are not new words. For example, ” 爱吃(love to eat)” and ” 爱说(love to talk)” can be matched by many lexical patterns, however, they are not new words due to the lack of non-compositionality. In such words, each single character has high probability to be a word. Thus, we design the following measure to favor this observation.
(5) |
where , each is a single character, and is the probability of the character being a word, as computed as follows:
where is the total frequency of , and is the frequency of being a single character word. Obviously, in order to obtain the value of , some particular Chinese word segmentation tool is required. In this work, we resort to ICTCLAS [21], a widely used tool in the literature.
New words are usually multi-word expressions, where a variety of statistical measures have been proposed to detect multi-word expressions. Thus, such measures can be naturally incorporated into our algorithm.
The first measure is enhanced mutual information (EMI) [22]:
(6) |
where is the number of posts in which a multi-word expression occurs, is the number of posts where occurs, and is the total number of posts. The key idea of EMI is to measure word pair’s dependency as the ratio of its probability of being a multi-word to its probability of not being a multi-word. The larger the value, the more possible the expression will be a multi-word expression.
The second measure we take into account is normalized multi-word expression distance [2], which has been proposed to measure the non-compositionality of multi-word expressions.
(7) |
where is the set of documents in which all single words in co-occur, is the set of documents in which word occurs as a whole, and is the total number of documents. Different from EMI, this measure is a strict distance metric, meaning that a smaller value indicates a larger possibility of being a multi-word expression. As can be seen from the formula, the key idea of this metric is to compute the ratio of the co-occurrence of all words in a multi-word expressions to the occurrence of the whole expression.
Taking into account the aforementioned factors, we have different settings to score a new word, as follows:
(8) |
(9) |
(10) |
(11) |
(12) |
In this section, we will conduct the following experiments: first, we will compare our method to several baselines, and perform parameter tuning with extensive experiments; second, we will classify polarity of new sentiment words using two methods; third, we will demonstrate how new sentiment words will benefit sentiment classification.
We crawled 237,108,977 Weibo posts from http://www.weibo.com, the largest social website in China. These posts range from January of 2011 to December of 2012. The posts were then part-of-speech tagged using a Chinese word segmentation tool named ICTCLAS [21].
Then, we asked two annotators to label the top 5,000 frequent words that were extracted by lexical patterns as described in Algorithm 1. The annotators were requested to judge whether a candidate word is a new word, and also to judge the polarity of a new word (positive, negative, and neutral). If there is a disagreement on either of the two tasks, discussions are required to make the final decision. The annotation led to 323 new words, among which there are 116 positive words, 112 negative words, and 95 neutral words33All the resources are available upon request..
As our algorithm outputs a ranked list of words, we adapt average precision to evaluate the performance of new sentiment word detection. The metric is computed as follows:
where is the precision at cut-off , is 1 if the word at position is a new word and 0 otherwise, and is the number of words in the ranked list. A perfect list (all top items are correct) has an AP value of 1.0.
First, we assess the influence of likelihood ratio test, which measures the association of a word to the pattern set. As can be seen from Table 4, the association model (LRT) remarkably boosts the performance of new word detection, indicating LRT is a key factor for new sentiment word extraction. From linguistic perspectives, new sentiment words are commonly modified by adverbial words and thus should have close association with lexical patterns.
100 | 200 | 300 | 400 | 500 | |
LPE | 0.366 | 0.324 | 0.286 | 0.270 | 0.259 |
LRT+LPE | 0.743 | 0.652 | 0.613 | 0.582 | 0.548 |
LPE+NWP | 0.467 | 0.400 | 0.350 | 0.330 | 0.320 |
LRT+LPE+NWP | 0.755 | 0.680 | 0.612 | 0.571 | 0.543 |
LPE+EMI | 0.608 | 0.551 | 0.519 | 0.486 | 0.467 |
LRT+LPE+EMI | 0.859 | 0.759 | 0.717 | 0.662 | 0.632 |
LPE+NMED | 0.749 | 0.690 | 0.641 | 0.612 | 0.576 |
LRT+LPE+NMED | 0.907 | 0.808 | 0.741 | 0.723 | 0.699 |
Second, we compare different settings of our method to two baselines. The first one is enhanced mutual information (EMI) where we set [22] and the second baseline is normalized multi-word expression distance (NMED) [2] where we set . The results are shown in Figure 1. As can be seen, all the proposed measures outperform the two baselines ( and ) remarkably and consistently. The setting of produces the best performance. Adding or leads to remarkable improvements because of their capability of measuring non-compositionality of new words. Only using can obtain a fairly good results when is small, however, the performance drops sharply because it’s unable to measure non-compositionality. Comparison between (or ) and shows that inclusion of left pattern entropy also boosts the performance apparently. However, the new word probability () has only marginal contribution to improvement.
In the above experiments, we set (the number of patterns chosen at each iteration) and (the number of words added at each iteration), which is the optimal setting and will be discussed in the next subsection. And only one seed word ”坑爹(reverse one’s expectation)” is used.
Firstly, we will show how to obtain the optimal settings of and . The measure setting we take here is , as shown in Formula (12). Again, we choose only one seed word ”坑爹(reverse one’s expectation)”, and the number of words returned is set to . Results in Table 5 show that the performance drops consistently across different settings when the number of patterns increases. Note that at the early stage of Algorithm 1, larger (perhaps with noisy patterns) may lead to lower quality of new words; while larger (perhaps with noisy seed words) may lead to lower quality of lexical patterns. Therefore, we choose the optimal setting to small numbers, as .
\backslashbox | 2 | 3 | 4 | 5 | 10 | 20 | 50 |
---|---|---|---|---|---|---|---|
5 | 0.753 | 0.738 | 0.746 | 0.741 | 0.741 | 0.734 | 0.715 |
10 | 0.753 | 0.738 | 0.746 | 0.741 | 0.741 | 0.728 | 0.712 |
15 | 0.753 | 0.738 | 0.746 | 0.741 | 0.754 | 0.734 | 0.718 |
20 | 0.763 | 0.738 | 0.744 | 0.749 | 0.749 | 0.735 | 0.717 |
Secondly, we justify whether the proposed algorithm is sensitive to the number of seed words. We set and , and take as the weighting measure of new word. We experimented with only one seed word, two, three, and four seed words, respectively. The results in Table 6 show very stable performance when different numbers of seed words are chosen. It’s interesting that the performance is totally the same with different numbers of seed words. By looking into the pattern set and the selected words at each iteration, we found that the pattern set () converges soon to the same set after a few iterations; and at the beginning several iterations, the selected words are almost the same although the order of adding the words is different. Since the algorithm will finally sort the words at step (11) and is the same, the ranking of the words becomes all the same.
1 | 2 | 3 | 4 | |
K=100 | 0.907 | 0.907 | 0.907 | 0.907 |
K=200 | 0.808 | 0.808 | 0.808 | 0.808 |
K=300 | 0.741 | 0.741 | 0.741 | 0.741 |
K=400 | 0.709 | 0.709 | 0.709 | 0.709 |
K=500 | 0.685 | 0.685 | 0.685 | 0.685 |
Lastly, we need to decide the optimal number of patterns in (that is, in Algorithm 1) because the set has been used in computing left pattern entropy, see Formula (4). Too small size of may lead to insufficient estimation of left pattern entropy. Results in Table 7 shows that larger decrease the performance, particularly when the number of words returned () becomes larger. Therefore, we set .
50 | 100 | 200 | 300 | 400 | 500 | |
K=100 | 0.907 | 0.905 | 0.916 | 0.916 | 0.888 | 0.887 |
K=200 | 0.808 | 0.810 | 0.778 | 0.776 | 0.766 | 0.764 |
K=300 | 0.741 | 0.731 | 0.722 | 0.726 | 0.712 | 0.713 |
K=400 | 0.709 | 0.708 | 0.677 | 0.675 | 0.656 | 0.655 |
K=500 | 0.685 | 0.683 | 0.653 | 0.646 | 0.626 | 0.627 |
In this section, we attempt to classifying the polarity of the annotated 323 new words. Two methods are adapted with different settings for this purpose. The first one is majority vote (MV), and the second one is pointwise mutual information, similar to [20]. The majority vote method is formulated as below:
where and are a positive and negative set of emoticons (or seed words) respectively, and is the co-occurrence count of the input word and the item . The polarity is judged according to this rule: if , the word is positive; if the word negative; otherwise neutral. The threshold is manually tuned.
And PMI is computed as follows:
where , and denotes probability. The polarity is judged according to the rule: if , is positive; if negative; otherwise neutral. The threshold is manually tuned.
Emoticon | Polarity | Emoticon | Polarity | |
---|---|---|---|---|
positive | negative | |||
positive | negative | |||
positive | negative | |||
positive | negative | |||
positive | negative | |||
As for the resources and , we have three settings. The first setting (denoted by Large_Emo) is a set of most frequent 36 emoticons in which there are 21 positive and 15 negative emoticons respectively. The second one (denoted by Small_Emo) is a set of 10 emoticons, which are chosen from the 36 emoticons, as shown in Table 8. The third one (denoted by Opin_Words) is two sets of seed opinion words, where ={ 高兴(happy),大方(generous),漂亮(beautiful), 善良(kind), 聪明(smart)} and {伤心(sad),小气(mean),难看(ugly), 邪恶(wicked), 笨(stupid)}.
The performance of polarity prediction is shown in Table 9. In two-class polarity classification, we remove neutral words and only make prediction with positive/negative classes. The first observation is that the performance of using emoticons is much better than that of using seed opinion words. We conjecture that this may be because new sentiment words are more frequently co-occurring with emoticons than with these opinion words. The second observation is that three-class polarity classification is much more difficult than two-class polarity classification because many extracted new words are nouns such as ”基友(gay)”,”菇凉(girl)”, and ”盆友(friend)”. Such nouns are more difficult to classify sentiment orientation.
Methods | Majority vote | PMI |
---|---|---|
Two-class polarity classification | ||
Large_Emo | 0.861 | 0.865 |
Small_Emo | 0.846 | 0.851 |
Opin_Words | 0.697 | 0.654 |
Three-class polarity classification | ||
Large_Emo | 0.598 | 0.632 |
Small_Emo | 0.551 | 0.635 |
Opin_Words | 0.449 | 0.486 |
In this section, we justify whether inclusion of new sentiment word would benefit sentiment classification. For this purpose, we randomly sampled and annotated 4,500 Weibo posts that contain at least one opinion word in the union of the Hownet 44http://www.keenage.com/html/c_index.html. opinion lexicons and our annotated new words. We apply two models for polarity classification. The first model is a lexicon-based model (denoted by ) that counts the number of positive and negative opinion words in a post respectively, and classifies a post to be positive if there are more positive words than negative ones, and to be negative otherwise. The second model is a SVM model in which opinion words are used as feature, and 5-fold cross validation is conducted.
We experiment with different settings of Hownet lexicon resources:
Hownet opinion words (denoted by Hownet): After removing some obviously inappropriate words, the left lexicons have 627 positive opinion words and 1,038 negative opinion words, respectively.
Compact Hownet opinion words (denoted by cptHownet): we count the frequency of the above opinion words on the training data and remove words whose document frequency is less than 2. This results in 138 positive words and 125 negative words.
Then, we add into the above resources the labeled new polar words(denoted by , including 116 positive and 112 negative words) and the top 100 words produced by the algorithm (denoted by ), respectively. Note that the lexicon-based model requires the sentiment orientation of each dictionary entry 55This is not necessary for the SVM model. All words in the top 100 words can be used as feature., we thus manually label the polarity of all top 100 words (we did NOT remove incorrect new word). This results in 52 positive and 34 negative words.
Results in Table 10 show that inclusion of new words in both models improves the performance remarkably. In the setting of the original lexicon (Hownet), both models obtain 2-3% gains from the inclusion of new words. Similar improvement is observed in the setting of the compact lexicon. Note, that is automatically obtained from Algorithm 1 so that it may contain words that are not new sentiment words, but the resource also improves performance remarkably.
Pos/Neg | Lexicon | SVM | |
Hownet | 627/1,038 | 0.737 | 0.756 |
Hownet+NW | 743/1,150 | 0.770 | 0.779 |
Hownet+T100 | 679/1,172 | 0.761 | 0.774 |
cptHownet | 138/125 | 0.738 | 0.758 |
cptHownet+NW | 254/237 | 0.774 | 0.782 |
cptHownet+T100 | 190/159 | 0.764 | 0.775 |
In order to extract new sentiment words from large-scale user-generated content, this paper proposes a fully unsupervised, purely data-driven, and almost knowledge-free (except POS tags) framework. We design statistical measures to quantify the utility of a lexical pattern and to measure the possibility of a word being a new word, respectively. The method is almost free of linguistic resources (except POS tags), and does not rely on elaborated linguistic rules. We conduct extensive experiments to reveal the influence of different statistical measures in new word finding. Comparative experiments show that our proposed method outperforms baselines remarkably. Experiments also demonstrate that inclusion of new sentiment words benefits sentiment classification definitely.
From linguistic perspectives, our framework is capable to extract adjective new words because the lexical patterns usually modify adjective words. As future work, we are considering how to extract other types of new sentiment words, such as nounal new words that can express sentiment.
This work was partly supported by the following grants from: the National Basic Research Program (973 Program) under grant No. 2012CB316301 and 2013CB329403, the National Science Foundation of China project under grant No. 61332007 and No. 60803075, and the Beijing Higher Education Young Elite Teacher Project.