A Topic Model for Building Fine-grained Domain-specific Emotion Lexicon

[ [ [ [ [

Abstract

Emotion lexicons play a crucial role in sentiment analysis and opinion mining. In this paper, we propose a novel Emotion-aware LDA (EaLDA) model to build a domain-specific lexicon for predefined emotions that include anger, disgust, fear, joy, sadness, surprise. The model uses a minimal set of domain-independent seed words as prior knowledge to discover a domain-specific lexicon, learning a fine-grained emotion lexicon much richer and adaptive to a specific domain. By comprehensive experiments, we show that our model can generate a high-quality fine-grained domain-specific emotion lexicon.

‡]MinYang§]BaolinPeng§]ZhengChen†,¶]DingjuZhu^†^†thanks: DingjuZhuisthecorrespondingauthor‡]Kam-PuiChow\affil[†]SchoolofComputerScience,SouthChinaNormalUniversity,Guangzhou,China\affil[]dingjuzhu@gmail.com\affil[‡]DepartmentofComputerScience,TheUniversityofHongKong,HongKong\affil[]{myang,chow}@cs.hku.hk\affil[§]DepartmentofComputerScience,BeihangUniversity,Beijing,China\affil[]b.peng@cse.buaa.edu.cn,tzchen86@gmail.com\affil[¶]ShenzhenInstitutesofAdvancedTechnology,ChineseAcademyofSciences,Shenzhen,China

\setdefaultlanguage

english

1 Introduction

Due to the popularity of opinion-rich resources (e.g., online review sites, forums, blogs and the microblogging websites), automatic extraction of opinions, emotions and sentiments in text is of great significance to obtain useful information for social and security studies. Various opinion mining applications have been proposed by different researchers, such as question answering, opinion mining, sentiment summarization, etc. As the fine-grained annotated data are expensive to get, the unsupervised approaches are preferred and more used in reality. Usually, a high quality emotion lexicon play a significant role when apply the unsupervised approaches for fine-grained emotion classification.

Thus far, most lexicon construction approaches focus on constructing general-purpose emotion lexicons [11, 7, 16, 4]. However, since a specific word can carry various emotions in different domains, a general-purpose emotion lexicon is less accurate and less informative than a domain-specific lexicon [1]. In addition, in previous work, most of the lexicons label the words on coarse-grained dimensions (positive, negative and neutrality). Such lexicons cannot accurately reflect the complexity of human emotions and sentiments. Lastly, previous emotion lexicons are mostly annotated based on many manually constructed resources (e.g., emotion lexicon, parsers, etc.). This limits the applicability of these methods to a broader range of tasks and languages.

To meet the challenges mentioned above, we propose a novel EaLDA model to construct a domain-specific emotion lexicon consisting of six primary emotions (i.e., anger, disgust, fear, joy, sadness and surprise). The proposed EaLDA model extends the standard Latent Dirichlet Allocation (LDA) [3] model by employing a small set of seeds to guide the model generating topics. Hence, the topics consequently group semantically related words into a same emotion category. The lexicon is thus able to best meet the user’s specific needs. Our approach is a weakly supervised approach since only some seeds emotion sentiment words are needed to lanch the process of lexicon construction. In practical applications, asking users to provide some seeds is easy as they usually have a good knowledge what are important in their domains.

Extensive experiments are carried out to evaluate our model both qualitatively and quantitatively using benchmark dataset. The results demonstrate that our EaLDA model improves the quality and the coverage of state-of-the-art fine-grained lexicon.

2 Related Work

Emotion lexicon plays an important role in opinion mining and sentiment analysis. In order to build such a lexicon, many researchers have investigated various kinds of approaches. However, these methods could roughly be classified into two categories in terms of the used information. The first kind of approaches is based on thesaurus that utilizes synonyms or glosses to determine the sentiment orientation of a word. The availability of the WordNet [9] database is an important starting point for many thesaurus-based approaches [8, 7, 5]. The second kind of approaches is based on an idea that emotion words co-occurring with each others are likely to convey the same polarity. There are numerous studies in this field [14, 15, 5, 2].

Most of the previous studies for emotion lexicon construction are limited to positive and negative emotions. Recently, to enhance the increasingly emotional data, a few researches have been done to identity the fine-grained emotion of words [12, 6, 10]. For example, Gill et al. (2008) utilize computational linguistic tools to identity the emotions of the words (such as, joy, sadness, acceptance, disgust, fear, anger, surprise and anticipation). While, this approach is mainly for public use in general domains. Rao et al. (2012) propose an method of automatically building the word-emotion mapping dictionary for social emotion detection. However, the emtion lexicon is not outputed explicitly in this paper, and the approach is fully unsupervised which may be difficult to be adjusted to fit the personalized data set.

Our approach relates most closely to the method proposed by Xie and Li (2012) for the construction of lexicon annotated for polarity based on LDA model. Our approach differs from [17] in two important ways: first, we do not address the task of polarity lexicon construction, but instead we focus on building fine-grained emotion lexicon. Second, we don’t assume that every word in documents is subjective, which is impractical in real world corpus.

3 Algorithm

In this section, we rigorously define the emotion-aware LDA model and its learning algorithm. We descrige with the model description, a Gibbs sampling algorithm to infer the model parameters, and finally how to generate a emotion lexicon based on the model output.

3.1 Model Description

Like the standard LDA model, EaLDA is a generative model. To prevent conceptual confusion, we use a superscript “(e)” to indicate variables related to emotion topics, and use a superscript “(n)” to indicate variables of non-emotion topics. We assume that each document has two classes of topics: $M$ emotion topics (corresponding to $M$ different emotions) and $K$ non-emotion topics (corresponding to topics that are not associated with any emotion). Each topic is represented by a multinomial distribution over words. In addition, we assume that the corpus vocabulary consists of $V$ distinct words indexed by $\{1,\ldots,V\}$ .

For emotion topics, the EaLDA model draws the word distribution from a biased Dirichlet prior ${\rm Dir}(\beta_{k}^{(e)})$ . The vector $\beta_{k}^{(e)}\in\mathbb{R}^{V}$ is constructed with $\beta_{k}^{(e)}:=\gamma_{0}^{(e)}(1^{V}-\Omega_{k})+\gamma_{1}^{(e)}\Omega_{k}$ , for $k\in\{1,\ldots,M\}$ . $\Omega_{k,w}=1$ if and only if word $w$ is a seed word for emotion $k$ , otherwise $\Omega_{k,w}=0$ . The scalars $\gamma_{0}^{(e)}$ and $\gamma_{1}^{(e)}$ are hyperparameters of the model. Intuitively, when $\gamma_{1}^{(e)}>\gamma_{0}^{(e)}$ , the biased prior ensures that the seed words are more probably drawn from the associated emotion topic.

The generative process of word distributions for non-emotion topics follows the standard LDA definition with a scalar hyperparameter $\beta^{(n)}$ .

For each word in the document, we decide whether its topic is an emotion topic or a non-emotion topic by flipping a coin with head-tail probability $(p^{(e)},p^{(n)})$ , where $(p^{(e)},p^{(n)})\sim{\rm Dir}(\alpha)$ . The emotion (or non-emotion) topic is sampled according to a multinomial distribution ${\rm Mult}(\theta^{(e)})$ (or ${\rm Mult}(\theta^{(n)})$ ). Here, both $\theta^{(e)}$ and $\theta^{(n)}$ are document-level latent variables. They are generated from Dirichlet priors ${\rm Dir}(\alpha^{(e)})$ and ${\rm Dir}(\alpha^{(n)})$ with $\alpha^{(s)}$ and $\alpha^{(n)}$ being hyperparameters.

We summarize the generative process of the EaLDA model as below:

1.

for each emotion topic $k\in\{1,\ldots,M\}$ , draw $\phi_{k}^{(e)}\sim{\rm Dir}(\beta_{k}^{(e)})$
2.

for each non-emotion topic $k\in\{1,\ldots,K\}$ , draw $\phi_{k}^{(n)}\sim{\rm Dir}(\beta^{(n)})$
3.

for each document
1. (a)
  
  draw $\theta^{(e)}\sim{\rm Dir}(\alpha^{(e)})$
2. (b)
  
  draw $\theta^{(n)}\sim{\rm Dir}(\alpha^{(n)})$
3. (c)
  
  draw $(p^{(e)},p^{(n)})\sim Dir(\alpha)$
4. (d)
  
  for each word in document
  1. i.
    
    draw topic class indicator $s\sim{\rm Bernoulli}(p_{s})$
  2. ii.
    
    if $s=\mbox{``emotion topic''}$
    
    A.
    
    draw $z^{(e)}\sim{\rm Mult}(\theta^{(e)})$
    
    B.
    
    draw $w\sim{\rm Mult}(\phi_{z^{(e)}}^{(e)})$ , emit word $w$
  3. iii.
    
    otherwise
    
    A.
    
    draw $z^{(n)}\sim{\rm Mult}(\theta^{(n)})$
    
    B.
    
    draw $w\sim{\rm Mult}(\phi_{z^{(n)}}^{(n)})$ , emit word $w$

As an alternative representation, the graphical model of the the generative process is shown by Figure 1.

	$\displaystyle{\rm Pr}\left(p^{(e)},p^{(n)},\theta^{(e)},\phi^{(e)},\theta^{(n)% },\phi^{(n)}\|D\right)$
$\displaystyle\propto$	$\displaystyle{\rm Pr}\left(p^{(e)},p^{(n)},\theta^{(e)},\phi^{(e)},\theta^{(n)% },\phi^{(n)}\right)$
	$\displaystyle\quad\times\;{\rm Pr}\left(D\|p^{(e)},p^{(n)},\theta^{(e)},\phi^{(% e)},\theta^{(n)},\phi^{(n)}\right)$
$\displaystyle\propto$	$\displaystyle\left(p^{(e)}\right)^{\alpha+(\sum_{i=1}^{M}m_{i}^{(e)})}\cdot% \left(p^{(n)}\right)^{\alpha+(\sum_{j=1}^{K}m_{j}^{(n)})}$
	$\displaystyle\cdot\prod_{i=1}^{M}\left(\theta_{i}^{(e)}\right)^{\alpha^{(e)}+m% _{i}^{(e)}-1}\cdot\prod_{j=1}^{K}\left(\theta_{j}^{(n)}\right)^{\alpha^{(n)}+m% _{j}^{(n)}-1}$
	$\displaystyle\cdot\prod_{i=0}^{1}\prod_{w=1}^{V}\left(\phi_{i,w}^{(e)}\right)^% {\beta_{i,w}^{(e)}+n_{i,w}^{(e)}-1}$
	$\displaystyle\cdot\prod_{j=1}^{K}\prod_{w=1}^{V}\left(\phi_{j,w}^{(n)}\right)^% {\beta^{(n)}+n_{j,w}^{(n)}-1}$	(1)

$\displaystyle\mathbb{E}[p^{(e)}]$	$\displaystyle=$	$\displaystyle\cfrac{\alpha+\sum_{i=0}^{1}m_{i}^{(e)}}{2\alpha+\sum_{i=1}^{M}m_% {i}^{(e)}+\sum_{j=1}^{K}m_{j}^{(n)}}$	(4)
$\displaystyle\mathbb{E}[p^{(n)}]$	$\displaystyle=$	$\displaystyle\cfrac{\alpha+\sum_{j=1}^{K}m_{j}^{(n)}}{2\alpha+\sum_{i=1}^{M}m_% {i}^{(e)}+\sum_{j=1}^{K}m_{j}^{(n)}}$	(5)
$\displaystyle\mathbb{E}[\theta_{i}^{(e)}]$	$\displaystyle=$	$\displaystyle\frac{\alpha^{(e)}+m_{i}^{(e)}}{M\alpha^{(e)}+\sum_{i^{\prime}=1}% ^{M}m_{i^{\prime}}^{(e)}}$	(6)
$\displaystyle\mathbb{E}[\theta_{j}^{(n)}]$	$\displaystyle=$	$\displaystyle\frac{\alpha^{(e)}+m_{j}^{(n)}}{K\alpha^{(n)}+\sum_{j^{\prime}=1}% ^{K}m_{j^{\prime}}^{(n)}}$	(7)
$\displaystyle\mathbb{E}[\phi_{i,w}^{(e)}]$	$\displaystyle=$	$\displaystyle\frac{\beta_{i,w}^{(e)}+n_{i,w}^{(e)}}{\sum_{w^{\prime}=1}^{V}% \left(\beta_{i,w^{\prime}}^{(e)}+n_{i,w^{\prime}}^{(e)}\right)}$	(8)
$\displaystyle\mathbb{E}[\phi_{j,w}^{(n)}]$	$\displaystyle=$	$\displaystyle\frac{\beta_{j,w}^{(n)}+n_{j,w}^{(n)}}{V\beta^{(n)}+\sum_{w^{% \prime}=1}^{V}n_{j,w^{\prime}}^{(n)}}$	(9)

Anger	Disgust	Fear	Joy	Sadness	Surprise
attack	mar	terror	good	kill	surprise
warn	sex	troop	win	die	first
gunman	lebanon	flu	prize	kidnap	jump
baghdad	game	dead	victory	lose	marijuana
immigration	gaze	die	adopt	confuse	arrest
hit	cancer	cancer	madonna	crach	sweat
kidnap	amish	kidnap	celebrity	leave	find
kill	imigration	force	boost	cancer	attack
alzheim	sink	iraq	ship	flu	hiv
iraqi	force	fear	star	kidnap	discover

Algorithm	Anger	Disgust	Fear	Joy	Sadness	Surprise
WordNet-Affect	6.06%	-	-	22.81%	17.31%	9.92%
SWAT	7.06%	-	18.27%	14.91%	17.44%	11.78%
UA	16.03%	-	20.06%	4.21%	1.76%	15.00%
UPAR7	3.02%	-	4.72%	11.87%	17.44%	15.00%
EaLDA	16.65%	10.52%	26.21%	25.57%	36.85%	20.17%

A Topic Model for Building Fine-grained Domain-specific Emotion Lexicon

Abstract

1 Introduction

2 Related Work

3 Algorithm

3.1 Model Description

3.2 Inference Algorithm

3.3 Constructing Emotion Lexicon

4 Experiments

4.1 Datasets

4.2 Emotion Lexicon Construction

4.3 Document-level Emotion Classification

5 Conclusions and Future Work

References