Emotion lexicons play a crucial role in sentiment analysis and opinion mining. In this paper, we propose a novel Emotion-aware LDA (EaLDA) model to build a domain-specific lexicon for predefined emotions that include anger, disgust, fear, joy, sadness, surprise. The model uses a minimal set of domain-independent seed words as prior knowledge to discover a domain-specific lexicon, learning a fine-grained emotion lexicon much richer and adaptive to a specific domain. By comprehensive experiments, we show that our model can generate a high-quality fine-grained domain-specific emotion lexicon.
‡]MinYang§]BaolinPeng§]ZhengChen†,¶]DingjuZhu††thanks: DingjuZhuisthecorrespondingauthor‡]Kam-PuiChow\affil[†]SchoolofComputerScience,SouthChinaNormalUniversity,Guangzhou,China\affil[]dingjuzhu@gmail.com\affil[‡]DepartmentofComputerScience,TheUniversityofHongKong,HongKong\affil[]{myang,chow}@cs.hku.hk\affil[§]DepartmentofComputerScience,BeihangUniversity,Beijing,China\affil[]b.peng@cse.buaa.edu.cn,tzchen86@gmail.com\affil[¶]ShenzhenInstitutesofAdvancedTechnology,ChineseAcademyofSciences,Shenzhen,China
english
Due to the popularity of opinion-rich resources (e.g., online review sites, forums, blogs and the microblogging websites), automatic extraction of opinions, emotions and sentiments in text is of great significance to obtain useful information for social and security studies. Various opinion mining applications have been proposed by different researchers, such as question answering, opinion mining, sentiment summarization, etc. As the fine-grained annotated data are expensive to get, the unsupervised approaches are preferred and more used in reality. Usually, a high quality emotion lexicon play a significant role when apply the unsupervised approaches for fine-grained emotion classification.
Thus far, most lexicon construction approaches focus on constructing general-purpose emotion lexicons [11, 7, 16, 4]. However, since a specific word can carry various emotions in different domains, a general-purpose emotion lexicon is less accurate and less informative than a domain-specific lexicon [1]. In addition, in previous work, most of the lexicons label the words on coarse-grained dimensions (positive, negative and neutrality). Such lexicons cannot accurately reflect the complexity of human emotions and sentiments. Lastly, previous emotion lexicons are mostly annotated based on many manually constructed resources (e.g., emotion lexicon, parsers, etc.). This limits the applicability of these methods to a broader range of tasks and languages.
To meet the challenges mentioned above, we propose a novel EaLDA model to construct a domain-specific emotion lexicon consisting of six primary emotions (i.e., anger, disgust, fear, joy, sadness and surprise). The proposed EaLDA model extends the standard Latent Dirichlet Allocation (LDA) [3] model by employing a small set of seeds to guide the model generating topics. Hence, the topics consequently group semantically related words into a same emotion category. The lexicon is thus able to best meet the user’s specific needs. Our approach is a weakly supervised approach since only some seeds emotion sentiment words are needed to lanch the process of lexicon construction. In practical applications, asking users to provide some seeds is easy as they usually have a good knowledge what are important in their domains.
Extensive experiments are carried out to evaluate our model both qualitatively and quantitatively using benchmark dataset. The results demonstrate that our EaLDA model improves the quality and the coverage of state-of-the-art fine-grained lexicon.
Emotion lexicon plays an important role in opinion mining and sentiment analysis. In order to build such a lexicon, many researchers have investigated various kinds of approaches. However, these methods could roughly be classified into two categories in terms of the used information. The first kind of approaches is based on thesaurus that utilizes synonyms or glosses to determine the sentiment orientation of a word. The availability of the WordNet [9] database is an important starting point for many thesaurus-based approaches [8, 7, 5]. The second kind of approaches is based on an idea that emotion words co-occurring with each others are likely to convey the same polarity. There are numerous studies in this field [14, 15, 5, 2].
Most of the previous studies for emotion lexicon construction are limited to positive and negative emotions. Recently, to enhance the increasingly emotional data, a few researches have been done to identity the fine-grained emotion of words [12, 6, 10]. For example, Gill et al. (2008) utilize computational linguistic tools to identity the emotions of the words (such as, joy, sadness, acceptance, disgust, fear, anger, surprise and anticipation). While, this approach is mainly for public use in general domains. Rao et al. (2012) propose an method of automatically building the word-emotion mapping dictionary for social emotion detection. However, the emtion lexicon is not outputed explicitly in this paper, and the approach is fully unsupervised which may be difficult to be adjusted to fit the personalized data set.
Our approach relates most closely to the method proposed by Xie and Li (2012) for the construction of lexicon annotated for polarity based on LDA model. Our approach differs from [17] in two important ways: first, we do not address the task of polarity lexicon construction, but instead we focus on building fine-grained emotion lexicon. Second, we don’t assume that every word in documents is subjective, which is impractical in real world corpus.
In this section, we rigorously define the emotion-aware LDA model and its learning algorithm. We descrige with the model description, a Gibbs sampling algorithm to infer the model parameters, and finally how to generate a emotion lexicon based on the model output.
Like the standard LDA model, EaLDA is a generative model. To prevent conceptual confusion, we use a superscript “(e)” to indicate variables related to emotion topics, and use a superscript “(n)” to indicate variables of non-emotion topics. We assume that each document has two classes of topics: emotion topics (corresponding to different emotions) and non-emotion topics (corresponding to topics that are not associated with any emotion). Each topic is represented by a multinomial distribution over words. In addition, we assume that the corpus vocabulary consists of distinct words indexed by .
For emotion topics, the EaLDA model draws the word distribution from a biased Dirichlet prior . The vector is constructed with , for . if and only if word is a seed word for emotion , otherwise . The scalars and are hyperparameters of the model. Intuitively, when , the biased prior ensures that the seed words are more probably drawn from the associated emotion topic.
The generative process of word distributions for non-emotion topics follows the standard LDA definition with a scalar hyperparameter .
For each word in the document, we decide whether its topic is an emotion topic or a non-emotion topic by flipping a coin with head-tail probability , where . The emotion (or non-emotion) topic is sampled according to a multinomial distribution (or ). Here, both and are document-level latent variables. They are generated from Dirichlet priors and with and being hyperparameters.
We summarize the generative process of the EaLDA model as below:
for each emotion topic , draw
for each non-emotion topic , draw
for each document
draw
draw
draw
for each word in document
draw topic class indicator
if
draw
draw , emit word
otherwise
draw
draw , emit word
As an alternative representation, the graphical model of the the generative process is shown by Figure 1.
Assuming hyperparameters , , , and , , we develop a collapsed Gibbs sampling algorithm to estimate the latent variables in the EaLDA model. The algorithm iteratively takes a word from a document and sample the topic that this word belongs to.
Let the whole corpus excluding the current word be denoted by . Let (or ) indicate the number of occurrences of topic (or topic ) with word in the whole corpus. Let (or ) indicate the number of occurrence of topic (or topic ) in the current document. All these counts are defined excluding the current word. Using the definition of the EaLDA model and the Bayes Rule, we find that the joint density of these random variables are equal to
(1) |
According to equation (1), we see that , , and are mutually independent sets of random variables. Each of these random variables satisfies Dirichlet distribution with a specific set of parameters. By the mutual independence, we decompose the probability of the topic for the current word as
(2) |
(3) |
Then, by examining the property of Dirichlet distribution, we can compute expectations on the right hand side of equation (2) and equation (3) by
(4) | |||||
(5) | |||||
(6) | |||||
(7) | |||||
(8) | |||||
(9) |
Using the above equations, we can sample the topic for each word iteratively and estimate all latent random variables.
Our final step is to construct the domain-specific emotion lexicon from the estimates and that we obtained from the EaLDA model.
For each word in the vocabulary, we compare the values and . If is the largest, then the word is added to the emotion dictionary for the th emotion. Otherwise, is the largest among the values, which suggests that the word is more probably drawn from a non-emotion topic. Thus, the word is considered neutral and not included in the emotion dictionary.
Anger | Disgust | Fear | Joy | Sadness | Surprise |
---|---|---|---|---|---|
attack | mar | terror | good | kill | surprise |
warn | sex | troop | win | die | first |
gunman | lebanon | flu | prize | kidnap | jump |
baghdad | game | dead | victory | lose | marijuana |
immigration | gaze | die | adopt | confuse | arrest |
hit | cancer | cancer | madonna | crach | sweat |
kidnap | amish | kidnap | celebrity | leave | find |
kill | imigration | force | boost | cancer | attack |
alzheim | sink | iraq | ship | flu | hiv |
iraqi | force | fear | star | kidnap | discover |
Algorithm | Anger | Disgust | Fear | Joy | Sadness | Surprise |
---|---|---|---|---|---|---|
WordNet-Affect | 6.06% | - | - | 22.81% | 17.31% | 9.92% |
SWAT | 7.06% | - | 18.27% | 14.91% | 17.44% | 11.78% |
UA | 16.03% | - | 20.06% | 4.21% | 1.76% | 15.00% |
UPAR7 | 3.02% | - | 4.72% | 11.87% | 17.44% | 15.00% |
EaLDA | 16.65% | 10.52% | 26.21% | 25.57% | 36.85% | 20.17% |
In this section, we report empirical evaluations of our proposed model. Since there is no metric explicitly measuring the quality of an emotion lexicon, we demonstrate the performance of our algorithm in two ways: (1) we perform a case study for the lexicon generated by our algorithm, and (2) we compare the results of solving emotion classification task using our lexicon against different methods, and demonstrate the advantage of our lexicon over other lexicons and other emotion classification systems.
We conduct experiments to evaluate the effectiveness of our model on SemEval-2007 dataset. This is an gold-standard English dataset used in the 14th task of the SemEval-2007 workshop which focuses on classification of emotions in the text. The attributes include the news headlines, the score of emotions of anger, disgust, fear, joy, sad and surprise normalizing from 0 to 100. Two data sets are available: a training data set consisting of 250 records, and a test data set with 1000 records. Following the strategy used in [12], the task was carried out in an unsupervised setting for experiments.
In experiments, data preprocessing is performed on the data set. First, the texts are tokenized with a natural language toolkit NLTK11http://www.nltk.org. Then, we remove non-alphabet characters, numbers, pronoun, punctuation and stop words from the texts. Finally, Snowball stemmer22http://snowball.tartarus.org/ is applied so as to reduce the vocabulary size and settle the issue of data spareness.
We first settle down the implementation details for the EaLDA model, specifying the hyperparameters that we choose for the experiment. We set topic number , , and hyperparameters , , . The vector is constructed from the seed dictionary using .
As mentioned, we use a few domain-independent seed words as prior information for our model. To be specific, the seed words list contains 8 to 12 emotional words for each of the six emotion categories.33http://minyang.me/acl2014/seed-words.html However, it is important to note that the proposed models are flexible and do not need to have seeds for every topic.
Example words for each emotion generated from the SemEval-2007 dataset are reported in Table 1. The judgment is to some extent subjective. What we reported here are based on our judgments what are appropriate and what are not for each emotion topic. From Table 1, we observe that the generated words are informative and coherent. For example, the words “flu” and “cancer” are seemingly neutral by its surface meaning, actually expressing fear emotion for SemEval dataset. These domain-specific words are mostly not included in any other existing general-purpose emotion lexicons. The experimental results show that our algorithm can successfully construct a fine-grained domain-specific emotion lexicon for this corpus that is able to understand the connotation of the words that may not be obvious without the context.
We compare the performance between a popular emotion lexicon WordNet-Affect [13] and our approach for emotion classification task. We also compare our results with those obtained by three systems participating in the SemEval-2007 emotion annotation task: SWAT, UPAR7 and UA. The emotion classification results is evaluated for each emotion category separately. For each emotion category, we evaluates it as a binary classification problem. In the evaluation of emotion lexicons, the binary classification is performed in a very simple way. For each emotion category and each text, we compare the number of words within this emotion category, and the average number of words within other emotion categories, to output a binary prediction of 1 or 0. This simple approach is chosen to evaluate the robustness of our emotion lexicon.
In the experiments, performance is evaluated in terms of F1-score. We summarize the results in Table 2. As an easy observation, the emotion lexicon generated by the EaLDA model consistently and significantly outperforms the WordNet-Affect emotion lexicon and other three emotion classification systems. In particular, we are able to obtain an overall F1-score of 10.52% for disgust classification task which is difficult to work out using previously proposed methods. The advantage of our model may come from its capability of exploring domain-specific emotions which include not only explicit emotion words, but also implicit ones.
In this paper, we have presented a novel emotion-aware LDA model that is able to quickly build a fine-grained domain-specific emotion lexicon for languages without many manually constructed resources. The proposed EaLDA model extends the standard LDA model by accepting a set of domain-independent emotion words as prior knowledge, and guiding to group semantically related words into the same emotion category. Thus, it makes the emotion lexicon containing much richer and adaptive domain-specific emotion words. Experimental results showed that the emotional lexicons generated by our algorithm is of high quality, and can assist emotion classification task.
For future works, we hope to extend the proposed EaLDA model by exploiting discourse structure knowledge, which has been shown significant in identifying the polarity of content-aware words.