ConnotationWordNet:
Learning Connotation over the Word+Sense Network

Jun Seok Kang Song Feng Leman Akoglu Yejin Choi
Department of Computer Science
Stony Brook University
Stony Brook, NY 11794-4400
junkang, songfeng, leman, ychoi@cs.stonybrook.edu

Abstract

We introduce ConnotationWordNet, a connotation lexicon over the network of words in conjunction with senses. We formulate the lexicon induction problem as collective inference over pairwise-Markov Random Fields, and present a loopy belief propagation algorithm for inference. The key aspect of our method is that it is the first unified approach that assigns the polarity of both word- and sense-level connotations, exploiting the innate bipartite graph structure encoded in WordNet. We present comprehensive evaluation to demonstrate the quality and utility of the resulting lexicon in comparison to existing connotation and sentiment lexicons.

1 Introduction

We introduce ConnotationWordNet, a connotation lexicon over the network of words in conjunction with senses, as defined in WordNet. A connotation lexicon, as introduced first by Feng et al. (2011), aims to encompass subtle shades of sentiment a word may conjure, even for seemingly objective words such as “sculpture”, “Ph.D.”, “rosettes”. Understanding the rich and complex layers of connotation remains to be a challenging task. As a starting point, we study a more feasible task of learning the polarity of connotation.

For non-polysemous words, which constitute a significant portion of English vocabulary, learning the general connotation at the word-level (rather than at the sense-level) would be a natural operational choice. However, for polysemous words, which correspond to most frequently used words, it would be an overly crude assumption that the same connotative polarity should be assigned for all senses of a given word. For example, consider “abound”, for which lexicographers of WordNet prescribe two different senses:

•

(v) abound: (be abundant of plentiful; exist in large quantities)
•

(v) abound, burst, bristle: (be in a state of movement or action) “The room abounded with screaming children”; “The garden bristled with toddlers”

For the first sense, which is the most commonly used sense for “abound”, the general overtone of the connotation would seem positive. That is, although one can use this sense in both positive and negative contexts, this sense of “abound” seems to collocate more often with items that are good to be abundant (e.g., “resources”), than unfortunate items being abundant (e.g., “complaints”).

However, as for the second sense, for which “burst” and “bristle” can be used interchangeably with respect to this particular sense,¹¹Hence a sense in WordNet is defined by synset (= synonym set), which is the set of words sharing the same sense. the general overtone is slightly more negative with a touch of unpleasantness, or at least not as positive as that of the first sense. Especially if we look up the WordNet entry for “bristle”, there are noticeably more negatively connotative words involved in its gloss and examples.

This word sense issue has been a universal challenge for a range of Natural Language Processing applications, including sentiment analysis. Recent studies have shown that it is fruitful to tease out subjectivity and objectivity corresponding to different senses of the same word, in order to improve computational approaches to sentiment analysis (e.g. Pestian et al. (2012), Mihalcea et al. (2012) Balahur et al. (2014)). Encouraged by these recent successes, in this study, we investigate if we can attain similar gains if we model the connotative polarity of senses separately.

There is one potential practical issue we would like to point out in building a sense-level lexical resource, however. End-users of such a lexicon may not wish to deal with Word Sense Disambiguation (WSD), which is known to be often too noisy to be incorporated into the pipeline with respect to other NLP tasks. As a result, researchers often would need to aggregate labels across different senses to derive the word-level label. Although such aggregation is not entirely unreasonable, it does not seem to be the most optimal and principled way of integrating available resources.

Therefore, in this work, we present the first unified approach that learns both sense- and word-level connotations simultaneously. This way, end-users will have access to more accurate sense-level connotation labels if needed, while also having access to more general word-level connotation labels. We formulate the lexicon induction problem as collective inference over pairwise-Markov Random Fields (pairwise-MRF) and derive a loopy belief propagation algorithm for inference.

The key aspect of our approach is that we exploit the innate bipartite graph structure between words and senses encoded in WordNet. Although our approach seems conceptually natural, previous approaches, to our best knowledge, have not directly exploited these relations between words and senses for the purpose of deriving lexical knowledge over words and senses collectively. In addition, previous studies (for both sentiment and connotation lexicons) aimed to produce only either of the two aspects of the polarity: word-level or sense-level, while we address both.

Another contribution of our work is the introduction of loopy belief propagation (loopy-BP) as a lexicon induction algorithm. Loopy-BP in our study achieves statistically significantly better performance over the constraint optimization approaches previously explored. In addition, it runs much faster and it is considerably easier to implement. Last but not least, by using probabilistic representation of pairwise-MRF in conjunction with Loopy-BP as inference, the resulting solution has the natural interpretation as the intensity of connotation. This contrasts to approaches that seek discrete solutions such as Integer Linear Programming[21].

ConnotationWordNet, the final outcome of our study, is a new lexical resource that has connotation labels over both words and senses following the structure of WordNet. The lexicon is publicly available at: http://www.cs.sunysb.edu/~junkang/connotation_wordnet.)

In what follows, we will first describe the network of words and senses (Section 2), then introduce the representation of the network structure as pairwise Markov Random Fields, and a loopy belief propagation algorithm as collective inference (Section 3). We then present comprehensive evaluation (Section 4 & 5 & 6), followed by related work (Section 7) and conclusion (Section 8).

2 Network of Words and Senses

Figure 1:

\text{G}^{\text{Word+Sense}}

with words and senses.

The connotation graph, called $\text{G}^{\text{Word+Sense}}$ , is a heterogeneous graph with multiple types of nodes and edges. As shown in Figure 1, it contains two types of nodes; (i) lemmas (i.e., words, 115K) and (ii) synsets (63K), and four types of edges; ( $t_{1}$ ) predicate-argument (179K), ( $t_{2}$ ) argument-argument (144K), ( $t_{3}$ ) argument-synset (126K), and ( $t_{4}$ ) synset-synset (3.4K) edges.

The predicate-argument edges, first introduced by Feng et al. (2011), depict the selectional preference of connotative predicates (i.e., the polarity of a predicate indicates the polarity of its arguments) and encode their co-occurrence relations based on the Google Web 1T corpus. The argument-argument edges are based on the distributional similarities among the arguments. The argument-synset edges capture the synonymy between argument nodes through the corresponding synsets. Finally, the synset-synset edges depict the antonym relations between synset pairs.

In general, our graph construction is similar to that of Feng et al. (2013), but there are a few important differences. Most notably, we model both words and synsets explicitly, and exploit the membership relations between words and senses. We expect that edges between words and senses will encourage senses that belong to the same word to receive the same connotation label. Conversely, we expect that these edges will also encourage words that belong to the same sense (i.e., synset definition) to receive the same connotation label.

Another benefit of our approach is that for various WordNet relations (e.g., antonym relations), which are defined over synsets (not over words), we can add edges directly between corresponding synsets, rather than projecting (i.e., approximating) those relations over words. Note that the latter, which has been employed by several previous studies (e.g., Kamps et al. (2004), Takamura et al. (2005), Andreevskaia and Bergler (2006), Su and Markert (2009), Lu et al. (2011), Kaji and Kitsuregawa (2007), Feng et al. (2013)), could be a source of noise, as one needs to assume that the semantic relation between a pair of synsets transfers over the pair of words corresponding to that pair of synsets. For polysemous words, this assumption may be overly strong.

3 Pairwise Markov Random Fields and Loopy Belief Propagation

We formulate the task of learning sense- and word-level connotation lexicon as a graph-based classification task [26]. More formally, we denote the connotation graph $\text{G}^{\text{Word+Sense}}$ by $G=(V,E)$ , in which a total of $n$ word and synset nodes $V=\{v_{1},\ldots,v_{n}\}$ are connected with typed edges $e(v_{i},v_{j},t)\in E$ , where edge types $t\in\{pred$ - $a r g, a r g$ - $a r g, s y n$ - $a r g, s y n$ - $syn\}$ depict the four edge types as described in Section 2. A neighborhood function $\mathcal{N}$ , where $\mathcal{N}_{v}=\{u|\;e(u,v)\in E\}\subseteq V$ , describes the underlying network structure.

In our collective classification formulation, each node in $V$ is represented as a random variable that takes a value from an appropriate class label domain; in our case, $\mathcal{L}=\{+,-\}$ for positive and negative connotation. In this classification task, we denote by $\mathcal{Y}$ the nodes the labels of which need to be assigned, and let $y_{i}$ refer to $Y_{i}$ ’s label.

3.1 Pairwise Markov Random Fields

We next define our objective function. We propose to use an objective formulation that utilizes pairwise Markov Random Fields (MRFs) [14], which we adapt to our problem setting. MRFs are a class of probabilistic graphical models that are suited for solving inference problems in networked data. An MRF consists of an undirected graph where each node can be in any of a finite number of states (i.e., class labels). The state of a node is assumed to be dependent on each of its neighbors and independent of other nodes in the graph.²²This assumption yields a pairwise Markov Random Field (MRF); a special case of general MRFs [37]. In pairwise MRFs, the joint probability of the graph can be written as a product of pairwise factors, parameterized over the edges. These factors are referred to as clique potentials in general MRFs, which are essentially functions that collectively determine the graph’s joint probability.

Specifically, let $G=(V,E)$ denote a network of random variables, where $V$ consists of the unobserved variables $\mathcal{Y}$ that need to be assigned values from label set $\mathcal{L}$ . Let $\Psi$ denote a set of clique potentials that consists of two types of factors:

•

For each $Y_{i}\in\mathcal{Y}$ , $\psi_{i}\in\Psi$ is a prior mapping $\psi_{i}:\mathcal{L}\rightarrow\mathbb{R}_{\geq 0}$ , where $\mathbb{R}_{\geq 0}$ denotes non-negative real numbers.
•

For each $e(Y_{i},Y_{j},t)\in E$ , $\psi_{ij}^{t}\in\Psi$ is a compatibility mapping $\psi_{ij}^{t}:\mathcal{L}\times\mathcal{L}\rightarrow\mathbb{R}_{\geq 0}$ .

Objective formulation

Given an assignment $\mathbf{y}$ to all the unobserved variables $\mathcal{Y}$ and $\mathbf{x}$ to observed ones $\mathcal{X}$ (variables with known labels, if any), our objective function is associated with the following joint probability distribution

P(\mathbf{y}|\mathbf{x})=\frac{1}{Z(\mathbf{x})}\prod\limits_{Y_{i}\in\mathcal% {Y}}\psi_{i}(y_{i})\prod\limits_{e(Y_{i},Y_{j},t)\in E}\psi_{ij}^{t}(y_{i},y_{% j})

(1)

where $Z(\mathbf{x})$ is the normalization function. Our goal is then to infer the maximum likelihood assignment of states (i.e., labels) to unobserved variables (i.e., nodes) that will maximize Equation (1).

Problem Definition

Having introduced our graph-based classification task and objective formulation, we define our problem more formally.

Given

-

a connotation graph $G=(V,E)$ of words and synsets connected with typed edges,
-

prior knowledge (i.e., probabilities) of (some or all) nodes belonging to each class,
-

compatibility of two nodes with a given pair of labels being connected to each other;

Classify the nodes $Y_{i}\in\mathcal{Y}$ , into one of two classes; $\mathcal{L}=\{+,-\}$ , such that the class assignments $y_{i}$ maximize our objective in Equation (1).

We can further rank the network objects by the probability of their connotation polarity.

3.2 Loopy Belief Propagation

Finding the best assignments to unobserved variables in our objective function is the inference problem. The brute force approach through enumeration of all possible assignments is exponential and thus intractable. In general, exact inference is known to be NP-hard and there is no known algorithm which can be theoretically shown to solve the inference problem for general MRFs. Therefore in this work, we employ a computationally tractable (in fact linearly scalable with network size) approximate inference algorithm called Loopy Belief Propagation (LBP) [37], which we extend to handle typed graphs like our connotation graph.

Our inference algorithm is based on iterative message passing and the core of it can be concisely expressed as the following two equations:

	$\displaystyle m_{i\rightarrow j}$	$\displaystyle(y_{j})=\alpha\sum\limits_{y_{i}\in\mathcal{L}}\bigg(\psi_{ij}^{t% }(y_{i},y_{j})\text{ }\psi_{i}(y_{i})$
		$\displaystyle\prod\limits_{Y_{k}\in\mathcal{N}_{i}\cap\mathcal{Y}\backslash Y_% {j}}m_{k\rightarrow i}(y_{i})\bigg),\text{ }\text{ }\forall y_{j}\in\mathcal{L}$		(2)

b_{i}(y_{i})=\beta\text{ }\psi_{i}(y_{i})\prod\limits_{Y_{j}\in\mathcal{N}_{i}% \cap\mathcal{Y}}m_{j\rightarrow i}(y_{i}),\forall y_{i}\in\mathcal{L}

(3)

A message $m_{i\rightarrow j}$ is sent from node $i$ to node $j$ and captures the belief of $i$ about $j$ , which is the probability distribution over the labels of $j$ ; i.e. what $i$ “thinks” $j$ ’s label is, given the current label of $i$ and the type of the edge that connects $i$ and $j$ . Beliefs refer to marginal probability distributions of nodes over labels; for example $b_{i}(y_{i})$ denotes the belief of node $i$ having label $y_{i}$ . $\alpha$ and $\beta$ are the normalization constants, which respectively ensure that each message and each set of marginal probabilities sum to $1$ . At every iteration, each node computes its belief based on messages received from its neighbors, and uses the compatibility mapping to transform its belief into messages for its neighbors. The key idea is that after enough iterations of message passes between the nodes, the “conversations” are likely to come to a consensus, which determines the marginal probabilities of all the unknown variables.

{algorithm}

[!ht] \DontPrintSemicolon\SetAlgorithmNameAlgorithmprocedureList of procedures Connotation Inference \SetAlgoNlRelativeSize-5 Input: Connotation graph $G$ = $(V,E)$ , prior potentials $\psi_{s}$ for seed words $s\in S$ , and compatibility potentials $\psi_{ij}^{t}$ Output: Connotation label probabilities for each node $i\in V\backslash P$ \ForEach(// initialize msg.s) $e(Y_{i},Y_{j},t)\in E$ \ForEach $y_{j}\in\mathcal{L}$ $m_{i\rightarrow j}(y_{j})\leftarrow 1$ \ForEach(// initialize priors) $i\in V$ \ForEach $y_{j}\in\mathcal{L}$ \lIf $i\in S$ $\phi_{i}(y_{j})\leftarrow\psi_{i}(y_{j})$ \lElse $\phi_{i}(y_{j})\leftarrow 1/|\mathcal{L}|$ \Repeat( // iterative message passing)all messages stop changing \ForEach $e(Y_{i},Y_{j},t)\in E$ , $Y_{j}\in\mathcal{Y}^{V\backslash S}$ \ForEach $y_{j}\in\mathcal{L}$ Use Equation (3.2) \ForEach(// compute final beliefs) $Y_{i}\in\mathcal{Y}^{V\backslash S}$ \ForEach $y_{i}\in\mathcal{L}$ Use Equation (3)

The pseudo-code of our method is given in Algorithm 3.2. It first initializes all messages to $1$ and priors to unbiased (i.e., equal) probabilities for all nodes except the seed nodes for which the sentiment is known (lines 3-9). It then proceeds by making each $Y_{i}\in\mathcal{Y}$ communicate messages with their neighbors in an iterative fashion until the messages stabilize (lines 10-14), i.e. convergence is reached.³³Although convergence is not theoretically guaranteed, in practice LBP converges to beliefs within a small threshold of change (e.g., $10^{-6}$ ) fairly quickly with accurate results [20, 16, 2]. At convergence, we calculate the marginal probabilities, that is of assigning $Y_{i}$ with label $y_{i}$ , by computing the final beliefs $b_{i}(y_{i})$ (lines 15-17). We use these maximum likelihood probabilities for label assignment; for each node $i$ , we assign the label $\mathcal{L}_{i}\leftarrow\max_{\;y_{i}}b_{i}(y_{i})$ .

To completely define our algorithm, we need to instantiate the potentials $\Psi$ , in particular the priors and the compatibilities, which we discuss next.

Priors

The prior beliefs $\psi_{i}$ of nodes can be suitably initialized if there is any prior knowledge for their connotation sentiment (e.g., enjoy is positive, suffer is negative). As such, our method is flexible to integrate available side information. In case there is no prior knowledge available, each node is initialized equally likely to have any of the possible labels, i.e., $\frac{1}{|\mathcal{L}|}$ as in Algorithm 3.2 (line 9).

Compatibilities

The compatibility potentials can be thought of as matrices, with entries $\psi_{ij}^{t}(y_{i},y_{j})$ that give the likelihood of a node having label $y_{i}$ , given that it has a neighbor with label $y_{j}$ to which it is connected through a type $t$ edge. A key difference of our method from earlier models is that we use clique potentials that differ for edge types, since the connotation graph is heterogeneous. This is exactly because the compatibility of class labels of two adjacent nodes depends on the type of the edge connecting them: e.g., $+$ $\xrightarrow{syn\text{-}arg}$ $+$ is highly compatible, whereas $+$ $\xrightarrow{syn\text{-}syn}$ $+$ is unlikely; as $s y n$ - $a r g$ edges capture synonymy; i.e., words-sense memberships, while $s y n$ - $s y n$ edges depict antonym relations.

A sample instantiation of the compatibilities is shown in Table 1. Notice that the potentials for $pred\text{-}arg$ , $arg\text{-}arg$ , and $syn\text{-}arg$ capture homophily, i.e., nodes with the same label are likely to connect to each other through these types of edges.⁴⁴ $arg\text{-}arg$ edges are based on co-occurrence (see Section 2), which does not carry as strong indication of the same connotation as e.g., synonymy. Thus, we enforce less homophily for nodes connected through edges of $arg\text{-}arg$ type. On the other hand, $syn\text{-}syn$ edges connect nodes that are antonyms of each other, and thus the compatibilities capture the reverse relationship among their labels.

Table 1: Instantiation of compatibility potentials. Entry

\psi^{t}_{ij}(y_{i},y_{j})

is the compatibility of a node with label

y_{i}

having a neighbor labeled

y_{j}

, given the edge between

i

and

j

is type

t

, for small

\epsilon

$t$ : $t_{1}$	A
P	$+$	$-$
$+$	1- $\epsilon$	$\epsilon$
$-$	$\epsilon$	1- $\epsilon$

$t$ : $t_{2}$	A
A	$+$	$-$
$+$	1-2 $\epsilon$	2 $\epsilon$
$-$	2 $\epsilon$	1-2 $\epsilon$

(

t_{1}

)

pred\text{-}arg

(

t_{2}

)

arg\text{-}arg

$t$ : $t_{3}$	A
S	$+$	$-$
$+$	1- $\epsilon$	$\epsilon$
$-$	$\epsilon$	1- $\epsilon$

$t$ : $t_{4}$	S
S	$+$	$-$
$+$	$\epsilon$	1- $\epsilon$
$-$	1- $\epsilon$	$\epsilon$

(

t_{3}

)

syn\text{-}arg

(

t_{4}

)

syn\text{-}syn

(synonym relations)

(antonym relations)

Complexity analysis

Most demanding component of Algorithm 3.2 is the iterative message passing over the edges (lines 10-14), with time complexity $O(ml^{2}r)$ , where $m=|E|$ is the number of edges in the connotation graph, $l=|\mathcal{L}|$ , the classes, and $r$ , the iterations until convergence. Often, $l$ is quite small (in our case, $l=2$ ) and $r\ll m$ . Thus running time grows linearly with the number of edges and is scalable to large datasets.

4 Evaluation I: Agreement with Sentiment Lexicons

ConnotationWordNet is expected to be the superset of a sentiment lexicon, as it is highly likely for any word with positive/negative sentiment to carry connotation of the same polarity. Thus, we use two conventional sentiment lexicons, General Inquirer (GenInq) [27] and MPQA [36], as surrogates to measure the performance of our inference algorithm.

4.1 Variants of Graph Construction

The construction of the connotation graph, denoted by $\text{G}^{\text{Word+Sense}}$ , which includes words and synsets, has been described in Section 2. In addition to this graph, we tried several other graph constructions, the first three of which have previously been used in [10]. We briefly describe these graphs below, and compare performance on all the graphs in the proceeding.

$\text{G}^{\text{Word}}$ w/ Pred-Arg:

This is a (bipartite) subgraph of $\text{G}^{\text{Word+Sense}}$ , which only includes the connotative predicates and their arguments. As such, it contains only type $t_{1}$ edges. The edges between the predicates and the arguments can be weighted by their Point-wise Mutual Information (PMI)⁵⁵PMI scores are widely used in previous studies to measure association between words (e.g., [7], [31], [19]). based on the Google Web 1T corpus.

$\text{G}^{\text{Word}}$ w/ Overlay:

The second graph is also a proper subgraph of $\text{G}^{\text{Word+Sense}}$ , which includes the predicates and all the argument words. Predicate words are connected to their arguments as before. In addition, argument pairs ( $a_{1}$ , $a_{2}$ ) are connected if they occurred together in the “ $a_{1}$ and $a_{2}$ ” or “ $a_{2}$ and $a_{1}$ ” coordination [11, 24]. This graph contains both type $t_{1}$ and $t_{2}$ edges. The edges can also be weighted based on the distributional similarities of the word pairs.

$\text{G}^{\text{Word}}$ :

The third graph is a super-graph of $\text{G}^{\text{Word}}$ w/ Overlay, with additional edges, where argument pairs in synonym and antonym relation are connected to each other. Note that unlike the connotation graph $\text{G}^{\text{Word+Sense}}$ , it does not contain any synset nodes. Rather, the words that are synonyms or antonyms of each other are directly linked in the graph. As such, this graph contains all edge types $t_{1}$ through $t_{4}$ .

$\text{G}^{\text{Word+Sense}}$ w/ SynSim:

This is a super-graph of our original $\text{G}^{\text{Word+Sense}}$ graph; that is, it has all the predicate, arguments, and synset nodes, as well as the four types of edges between them. In addition, we add edges of a fifth type $t_{5}$ between the synset nodes to capture their similarity. To define similarity, we use the glossary definitions of the synsets and derive three different scores. Each score utilizes the count( $s_{1},s_{2}$ ) of overlapping nouns, verbs, and adjectives/adverbs among the glosses of the two synsets $s_{1}$ and $s_{2}$ .

$\text{G}^{\text{Word+Sense}}$ w/ SynSim1: We discard edges with count less than 3. The weighted version has the counts normalized between 0 and 1.

$\text{G}^{\text{Word+Sense}}$ w/ SynSim2: We normalize the counts by the length of the gloss (the avg of two lengths), that is, $p=$ count / avg(len_gloss( $s_{1}$ ), len_gloss( $s_{2}$ )) and discard edges with $p<0.5$ . The weighted version contains $p$ values as edge weights.

$\text{G}^{\text{Word+Sense}}$ w/ SynSim3: To further sparsify the graph we discard edges with $p<0.6$ . To weigh the edges, we use the cosine similarity between the gloss vectors of the synsets based on the TF-IDF values of the words the glosses contain.

Note that the connotation inference algorithm, as given in Algorithm 3.2, remains exactly the same for all the graphs described above. The only difference is the set of parameters used; while $\text{G}^{\text{Word}}$ w/ Pred-Arg and $\text{G}^{\text{Word}}$ w/ Overlay contain one and two edge types, respectively and only use compatibilities $(t_{1})$ and $(t_{2})$ , $\text{G}^{\text{Word}}$ uses all four as given in Table 1. The $\text{G}^{\text{Word+Sense}}$ w/ SynSim graphs use an additional compatibility matrix for the synset similarity edges of type $t_{5}$ , which is the same as the one used for $t_{1}$ , i.e., similar synsets are likely to have the same connotation label. This flexibility is one of the key advantages of our algorithm as new types of nodes and edges can be added to the graph seamlessly.

4.2 Sentiment-Lexicon based Performance

In this section, we first compare the performance of our connotation graph $\text{G}^{\text{Word+Sense}}$ to graphs that do not include synset nodes but only words. Then we analyze the performance when the additional synset similarity edges are added. First, we briefly describe our performance measures.

The sentiment lexicons we use as gold standard are small, compared to the size (i.e., number of words) our graphs contain. Thus, we first find the overlap between each graph and a sentiment lexicon. Note that the overlap size may be smaller than the lexicon size, as some sentiment words may be missing from our graphs. Then, we calculate the number of correct label assignments. As such, precision is defined as (correct / overlap), and recall as (correct / lexicon size). Finally, F1-score is their harmonic mean and reflects the overall accuracy.

As shown in Table 2 (top), we first observe that including the synonym and antonym relations in the graph, as with $\text{G}^{\text{Word}}$ and $\text{G}^{\text{Word+Sense}}$ , improve the performance significantly, almost by an order of magnitude, over graphs $\text{G}^{\text{Word}}$ w/ Pred-Arg and $\text{G}^{\text{Word}}$ w/ Overlay that do not contain those relation types. Furthermore, we notice that the performances on the $\text{G}^{\text{Word+Sense}}$ graph are better than those on the word-only graphs. This shows that including the synset nodes explicitly in the graph structure is beneficial. What is more, it gives us a means to obtain connotation labels for the synsets themselves, which we use in the evaluations in the next sections. Finally, we note that using the unweighted versions of the graphs provide relatively more robust performance, potentially due to noise in the relative edge weights.

	GenInq			Mpqa
	P	R	F	F
Variations of $\text{G}^{\text{Word}}$
w/ Pred-Arg	88.0	67.6	76.5	57.3
w/ Pred-Arg-w	84.9	68.9	76.1	57.8
w/ Overlay	87.8	70.4	78.1	58.4
w/ Overlay-w	82.2	67.7	74.2	54.2
$\text{G}^{\text{Word}}$	88.5	83.1	85.7	69.7
$\text{G}^{\text{Word}}$ -w	75.5	71.5	73.4	53.2
Variations of $\text{G}^{\text{Word+Sense}}$
$\text{G}^{\text{Word+Sense}}$	88.8	84.1	86.4	70.0
$\text{G}^{\text{Word+Sense}}$ -w	76.8	73.0	74.9	54.6
w/ SynSim1	87.2	83.3	85.2	67.9
w/ SynSim2	83.9	80.8	82.3	65.1
w/ SynSim3	86.5	83.2	84.8	67.8
w/ SynSim1-w	88.0	84.3	86.1	69.2
w/ SynSim2-w	86.4	83.7	85.0	68.5
w/ SynSim3-w	86.7	83.4	85.0	68.2

Table 2: Connotation inference performance on various graphs. ‘-w’ indicates weighted versions (see §4.1). P: precision, R: recall, F: F1-score (%).

Next we analyze the performance when the new edges between synsets are introduced, as given in Table 2 (bottom). We observe that connecting the synset nodes by their gloss-similarity (at least in the ways we tried) does not yield better performance than on our original $\text{G}^{\text{Word+Sense}}$ graph. Different from earlier, the weighted versions of the similarity based graphs provide better performance than their unweighted counterparts. This suggests that glossary similarity would be a more robust means to correlate nodes; we leave it as future work to explore this direction for predicate-argument and argument-argument relations.

4.3 Parameter Sensitivity

Our belief propagation based connotation sentiment inference algorithm has one user-specified parameter $\epsilon$ (see Table 1). To study the sensitivity of its performance to the choice of $\epsilon$ , we reran our experiments for $\epsilon=\{0.02,0.04,\ldots,0.24\}$ ⁶⁶Note that for $\epsilon>0.25$ , compatibilities of $\psi^{t_{2}}$ in Table 1 are reversed, hence the maximum of $0.24$ . and report the accuracy results on our $\text{G}^{\text{Word+Sense}}$ in Figure 2 for the two lexicons. The results indicate that the performances remain quite stable across a wide range of the parameter choice.


(a) GenInq Eval	(b) MPQA Eval

Figure 2: Performance is stable across various

\epsilon

5 Evaluation II: Human Evaluation on ConnotationWordNet

In this section, we present the result of human evaluation we executed using Amazon Mechanical Turk (AMT). We collect two separate sets of labels: a set of labels at the word-level, and another set at the sense-level. We first describe the labeling process of sense-level connotation: We selected 350 polysemous words and one of their senses, and each Turker was asked to rate the connotative polarity of a given word (or of a given sense), from -5 to 5, 0 being the neutral.⁷⁷Because senses in WordNet can be tricky to understand, care should be taken in designing the task so that the Turkers will focus only on the corresponding sense of a word. Therefore, we provided the part of speech tag, the WordNet gloss of the selected sense, and a few examples as given in WordNet. As an incentive, each Turker was rewarded $0.07 per hit which consists of 10 words to label. For each word, we asked 5 Turkers to rate and we took the average of the 5 ratings as the connotative intensity score of the word. We labeled a word as negative if its intensity score is less than 0 and positive otherwise. For word-level labels we apply similar procedure as above.

5.1 Word-Level Evaluation

We first evaluate the word-level assignment of connotation, as shown in Table 3. The agreement between the new lexicon and human judges varies between 84% and 86.98%. Sentiment lexicons such as SentiWordNet (Baccianella et al. (2010)) and OpinionFinder (Wilson et al. (2005a)) show low agreement rate with human, which is somewhat as expected: human judges in this study are labeling for subtle connotation, not for more explicit sentiment. OpinionFinder’s low agreement rate was mainly due to the low hit rate of the words (successful look-up rate, 33.43%). Feng2013 is the lexicon presented in [10] and it showed a relatively higher 72.13% hit rate.

Note that belief propagation was run until 95% and 99% of the nodes were converged in their beliefs. In addition, the seed words with known connotation labels originally consist of 20 positive and 20 negative predicates. We also extended the seed set with the sentiment lexicon words and denote these runs with e- for ‘extended’.

Lexicon	Word-level	Sense-level \bigstrut
SentiWordNet	27.22	14.29
OpinionFinder	31.95	-
Feng2013	62.72	-
$\text{G}^{\text{Word+Sense}}$ (95%)	84.91	83.43
$\text{G}^{\text{Word+Sense}}$ (99%)	84.91	83.71
e- $\text{G}^{\text{Word+Sense}}$ (95%)	86.98	86.29
e- $\text{G}^{\text{Word+Sense}}$ (99%)	86.69	85.71

Table 3: Word-/Sense-level evaluation results

5.2 Sense-Level Evaluation

We also examined the agreement rates on the sense-level. Since OpinionFinder and Feng2013 do not provide the polarity scores at the sense-level, we excluded them from this evaluation. Because sense-level polarity assignment is a harder (more subtle) task, the performance of all lexicons decreased to some degree in comparison to that of word-level evaluations.

5.3 Pair-wise Intensity Ranking

A notable goodness of our induction algorithm is that the outcome of the algorithm can be interpreted as an intensity of the corresponding connotation. But are these values meaningful? We answer this question in this section. We formulate a pair-wise ranking task as a binary decision task as follows: given a pair of words, we ask which one is more positive (or more negative) than the other. Since we collect human labels based on scales, we already have this information at hand. Because different human judges have different notion of scales however, subtle differences are more likely to be noisy. Therefore, we experiment with varying degrees of differences in their scales, as shown in Figure 3. Threshold values (ranging from 0.5 to 3.0) indicate the minimum differences in scales for any pair of words, for the pair to be included in the test set. As expected, we observe that the performance improves as we increase the threshold (as pairs get better separated). Within range [0.5, 1.5] (249 pairs examined), the accuracies are as high as 68.27%, which shows that even the subtle differences of the connotative intensities are relatively well reflected in the new lexicons.

Figure 3: Trend of accuracy for pair-wise intensity evaluation over threshold

The results for pair-wise intensity evaluation (threshold=2.0, 1,208 pairs) are given in Table 4. Despite that intensity is generally a harder property to measure (than the coarser binary categorization of polarities), our connotation lexicons perform surprisingly well, reaching up to 74.83% accuracy. Further study on the incorrect cases reveals that SentiWordNet has many pair of words with the same polarity score (23.34%). Such cases seems to be due to the limited score patterns of SentiWordNet. The ratio of such cases are accounted as Undecided in Table 4.

Lexicon	Correct	Undecided \bigstrut
SentiWordNet	33.77	23.34
$\text{G}^{\text{Word+Sense}}$ (95%)	74.83	0.58
$\text{G}^{\text{Word+Sense}}$ (99%)	73.01	0.58
e- $\text{G}^{\text{Word+Sense}}$ (95%)	73.84	1.16
e- $\text{G}^{\text{Word+Sense}}$ (99%)	74.01	1.16

Table 4: Results of pair-wise intensity evaluation, for intensity difference threshold = 2.0

6 Evaluation III: Sentiment Analysis using ConnotationWordNet

Finally, to show the utility of the resulting lexicon in the context of a concrete sentiment analysis task, we perform lexicon-based sentiment analysis. We experiment with SemEval dataset [28] that includes the human labeled dataset for predicting whether a news headline is a good news or a bad news, which we expect to have a correlation with the use of connotative words that we focus on in this paper. The good/bad news are annotated with scores (ranging from -100 to 87). We construct several data sets by applying different thresholds on scores. For example, with the threshold set to 60, we discard the instances whose scores lie between -60 and 60. For comparison, we also test the connotation lexicon from [10] and the combined sentiment lexicon GenInq+MPQA.

Note that there is a difference in how humans judge the orientation and the degree of connotation for a given word out of context, and how the use of such words in context can be perceived as good/bad news. In particular, we conjecture that humans may have a bias toward the use of positive words, which in turn requires calibration from the readers’ minds [22]. That is, we might need to tone down the level of positiveness in order to correctly measure the actual intended positiveness of the message.

With this in mind, we tune the appropriate calibration from a small training data, by using 1 fold from $N$ fold cross validation, and using the remaining $N-1$ folds as testing. We simply learn the mixture coefficient $\lambda$ to scale the contribution of positive and negative connotation values. We tune this parameter $\lambda$ ⁸⁸What is reported is based on $\lambda\in\{20,40,60,80\}$ . More detailed parameter search does not change the results much. for other lexicons we compare against as well. Note that due to this parameter learning, we are able to report better performance for the connotation lexicon of [10] than what the authors have reported in their paper (labeled with *) in Table 5.

Table 5 shows the results for $N$ = $15$ , where the new lexicon consistently outperforms other competitive lexicons. In addition, Figure 4 shows that the performance does not change much based on the size of training data used for parameter tuning ( $N$ = $\{5,10,15,20\}$ ).

4]*Lexicon	SemEval Threshold \bigstrut
4]*Lexicon	20	40	60	80 \bigstrut
Instance Size	955	649	341	86
Feng2013	71.5	77.1	81.6	90.5
GenInq+MPQA	72.8	77.2	80.4	86.7
$\text{G}^{\text{Word+Sense}}$ (95%)	74.5	79.4	86.5	91.9
$\text{G}^{\text{Word+Sense}}$ (99%)	74.6	79.4	86.8	91.9
e- $\text{G}^{\text{Word+Sense}}$ (95%)	72.5	76.8	82.3	87.2
e- $\text{G}^{\text{Word+Sense}}$ (99%)	72.6	76.9	82.5	87.2
Feng2013*	70.8	74.6	80.8	93.5
GenInq+MPQA*	64.5	69.0	74.0	80.5

Table 5: SemEval evaluation results, for

N

15

Figure 4: Trend of SemEval performance over

N

, the number of CV folds

7 Related Work

Several previous approaches explored the use of graph propagation for sentiment lexicon induction [32] and connotation lexicon induction [10]. Our work introduces the use of loopy belief propagation over pairwise-MRF as an alternative solution to these tasks. At a high-level, both approaches share the general idea of propagating confidence or belief over the graph connectivity. The key difference, however, is that in our MRF representation, we can explicitly model various types of word-word, sense-sense and word-sense relations as edge potentials. In particular, we can naturally encode relations that encourage the same assignment (e.g., synonym) as well as the opposite assignment (e.g., antonym) of the polarity labels. Note that integration of the latter is not straightforward in the graph propagation framework.

There have been a number of previous studies that aim to construct a word-level sentiment lexicon [34, 25] and a sense-level sentiment lexicon [8]. But none of these approaches considered to induce the polarity labels at both the word-level and sense-level. Although we focus on learning connotative polarity of words and senses in this paper, the same approach would be applicable to constructing a sentiment lexicon as well.

There have been recent studies that address word sense disambiguation issues for sentiment analysis. SentiWordNet [8] was the very first lexicon developed for sense-level labels of sentiment polarity. In recent years, Akkaya et al. (2009) report a successful empirical result where WSD helps improving sentiment analysis, while Wiebe and Mihalcea (2006) study the distinction between objectivity and subjectivity in each different sense of a word, and their empirical effects in the context of sentiment analysis. Our work shares the high-level spirit of accessing the sense-level polarity, while also deriving the word-level polarity.

In recent years, there has been a growing research interest in investigating more fine-grained aspects of lexical sentiment beyond positive and negative sentiment. For example, Mohammad and Turney (2010) study the affects words can evoke in people’s minds, while Bollen et al. (2011) study various moods, e.g., “tension”, “depression”, beyond simple dichotomy of positive and negative sentiment. Our work, and some recent work by Feng et al. (2011) and Feng et al. (2013) share this spirit by targeting more subtle, nuanced sentiment even from those words that would be considered as objective in early studies of sentiment analysis.

8 Conclusion

We have introduced a novel formulation of lexicon induction operating over both words and senses, by exploiting the innate structure between the words and senses as encoded in WordNet. In addition, we introduce the use of loopy belief propagation over pairwise-Markov Random Fields as an effective lexicon induction algorithm. A notable strength of our approach is its expressiveness: various types of prior knowledge and lexical relations can be encoded as node potentials and edge potentials. In addition, it leads to a lexicon of better quality while also offering faster run-time and easiness of implementation. The resulting lexicon, called ConnotationWordNet, is the first lexicon that has polarity labels over both words and senses. ConnotationWordNet is publicly available for research and practical use.

Acknowledgments

This research was supported by the Army Research Office under Contract No. W911NF-14-1-0029, Stony Brook University Office of Vice President for Research, and gifts from Northrop Grumman Aerospace Systems and Google. We thank reviewers for many insightful comments and suggestions.

References

[1] C. Akkaya, J. Wiebe and R. Mihalcea(2009) Subjectivity word sense disambiguation. pp. 190–199. Cited by: 7.
[2] L. Akoglu, R. Chandy and C. Faloutsos(2013) Opinion fraud detection in online reviews by network effects. Cited by: 3.2.
[3] A. Andreevskaia and S. Bergler(2006) Mining wordnet for a fuzzy sentiment: sentiment tag extraction from wordnet glosses.. pp. 209–216. Cited by: 2.
[4] S. Baccianella, A. Esuli and F. Sebastiani(2010) SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining.. Vol. 10, pp. 2200–2204. Cited by: 5.1.
[5] A. Balahur, R. Mihalcea and A. Montoyo(2014) Computational approaches to subjectivity and sentiment analysis: present and envisaged methods and applications. Computer Speech & Language 28 (1), pp. 1–6. Cited by: 1.
[6] J. Bollen, H. Mao and A. Pepe(2011) Modeling public mood and emotion: twitter sentiment and socio-economic phenomena.. Cited by: 7.
[7] K. W. Church and P. Hanks(1990) Word association norms, mutual information, and lexicography. Computational Linguistics 1 (16), pp. 22–29. Cited by: 4.1.
[8] A. Esuli and F. Sebastiani(2006) SENTIWORDNET: a publicly available lexical resource for opinion mining. pp. 417–422. Cited by: 7, 7.
[9] S. Feng, R. Bose and Y. Choi(2011) Learning general connotation of words using graph-based algorithms. pp. 1092–1103. Cited by: 1, 2, 7.
[10] S. Feng, J. S. Kang, P. Kuznetsova and Y. Choi(2013) Connotation lexicon: a dash of sentiment beneath the surface meaning.. pp. 1774–1784. Cited by: 2, 2, 4.1, 5.1, 6, 6, 7, 7.
[11] V. Hatzivassiloglou and K. McKeown(1997) Predicting the semantic orientation of adjectives. pp. 174–181. Cited by: 4.1.
[12] N. Kaji and M. Kitsuregawa(2007) Building lexicon for sentiment analysis from massive collection of html documents.. pp. 1075–1083. Cited by: 2.
[13] J. Kamps, M. Marx, R. J. Mokken and M. De Rijke(2004) Using wordnet to measure semantic orientations of adjectives. Cited by: 2.
[14] R. Kindermann and J. L. Snell(1980) Markov Random Fields and Their Applications. Cited by: 3.1.
[15] Y. Lu, M. Castellanos, U. Dayal and C. Zhai(2011) Automatic construction of a context-aware sentiment lexicon: an optimization approach. pp. 347–356. Cited by: 2.
[16] M. McGlohon, S. Bay, M. G. Anderle, D. M. Steier and C. Faloutsos(2009-07-02) SNARE: a link analytic system for graph labeling and risk detection.. pp. 1265–1274. External Links: ISBN 978-1-60558-495-9 Cited by: 3.2.
[17] R. Mihalcea, C. Banea and J. Wiebe(2012) Multilingual subjectivity and sentiment analysis. pp. 4–4. Cited by: 1.
[18] S. Mohammad and P. Turney(2010-06) Emotions evoked by common words and phrases: using mechanical turk to create an emotion lexicon. Los Angeles, CA, pp. 26–34. External Links: Link Cited by: 7.
[19] D. Newman, S. Karimi and L. Cavedon(2009-12) External evaluation of topic models. Sydney, pp. 11–18. External Links: ISBN 978-1-74210-171-2 Cited by: 4.1.
[20] S. Pandit, D. H. Chau, S. Wang and C. Faloutsos(2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. pp. 201–210. Cited by: 3.2.
[21] C. H. Papadimitriou and K. Steiglitz(1998) Combinatorial optimization: algorithms and complexity. Courier Dover Publications. Cited by: 1.
[22] J. W. Pennebaker and L. D. Stone(2003) Words of wisdom: language use over the life span.. Journal of personality and social psychology 85 (2), pp. 291. Cited by: 6.
[23] J. P. Pestian, P. Matykiewicz, M. Linn-Gust, B. South, O. Uzuner, J. Wiebe, K. B. Cohen, J. Hurdle and C. Brew(2012) Sentiment analysis of suicide notes: a shared task. Biomedical Informatics Insights 5 (Suppl. 1), pp. 3. Cited by: 1.
[24] M. J. Pickering and H. P. Branigan(1998) The representation of verbs: evidence from syntactic priming in language production. Journal of Memory and Language 39, pp. 633–651. Cited by: 4.1.
[25] G. Qiu, B. Liu, J. Bu and C. Chen(2009) Expanding domain sentiment lexicon through double propagation.. Vol. 9, pp. 1199–1204. Cited by: 7.
[26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher and T. Eliassi-Rad(2008) Collective classification in network data.. AI Magazine 29 (3), pp. 93–106. Cited by: 3.
[27] P. J. Stone, D. C. Dunphy, M. S. Smith and D. M. Ogilvie(1966) The general inquirer: a computer approach to content analysis. MIT Press, Cambridge, MA. Cited by: 4.
[28] C. Strapparava and R. Mihalcea(2007) Semeval-2007 task 14: affective text. pp. 70–74. Cited by: 6.
[29] F. Su and K. Markert(2009) Subjectivity recognition on word senses via semi-supervised mincuts. pp. 1–9. Cited by: 2.
[30] H. Takamura, T. Inui and M. Okumura(2005) Extracting semantic orientations of words using spin model. pp. 133–140. Cited by: 2.
[31] P. D. Turney(2001) Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Freiburg, Germany, pp. 491–502. Cited by: 4.1.
[32] L. Velikovich, S. Blair-Goldensohn, K. Hannan and R. McDonald(2010) The viability of web-derived polarity lexicons. Cited by: 7.
[33] J. Wiebe and R. Mihalcea(2006) Word sense and subjectivity. pp. 1065–1072. Cited by: 7.
[34] J. Wiebe, T. Wilson and C. Cardie(2005) Annotating expressions of opinions and emotions in language. Language Resources and Evaluation (formerly Computers and the Humanities) 39 (2/3), pp. 164–210. Cited by: 7.
[35] T. Wilson, P. Hoffmann, S. Somasundaran, J. Kessler, J. Wiebe, Y. Choi, C. Cardie, E. Riloff and S. Patwardhan(2005) OpinionFinder: a system for subjectivity analysis. pp. 34–35. Cited by: 5.1.
[36] T. Wilson, J. Wiebe and P. Hoffmann(2005) Recognizing contextual polarity in phrase-level sentiment analysis. Vancouver, CA. Cited by: 4.
[37] J. S. Yedidia, W. T. Freeman and Y. Weiss(2003) Understanding belief propagation and its generalizations. Exploring AI in the new millennium, pp. 239–269. Cited by: 3.1, 3.2.

Generated on Wed Jun 11 19:06:14 2014 by LaTeXML [LOGO]

ConnotationWordNet: Learning Connotation over the Word+Sense Network