Learning Sentiment-Specific Word Embedding
for Twitter Sentiment Classification^†^†thanks: This work was done when the first and third authors were visiting Microsoft Research Asia.

Duyu Tang

{}^{{\dagger}}

, Furu Wei

{}^{{\ddagger}}

, Nan Yang

{}^{\natural}

, Ming Zhou

{}^{{\ddagger}}

, Ting Liu

{}^{{\dagger}}

, Bing Qin

{}^{{\dagger}}

{}^{{\dagger}}

Research Center for Social Computing and Information Retrieval
Harbin Institute of Technology, China

{}^{{\ddagger}}

Microsoft Research, Beijing, China

{}^{\natural}

University of Science and Technology of China, Hefei, China
{dytang, tliu, qinb}@ir.hit.edu.cn
{fuwei, v-nayang, mingzhou}@microsoft.com

Abstract

We present a method that learns word embedding for Twitter sentiment classification in this paper. Most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text. This is problematic for sentiment analysis as they usually map words with similar syntactic context but opposite sentiment polarity, such as good and bad, to neighboring word vectors. We address this issue by learning sentiment-specific word embedding (SSWE), which encodes sentiment information in the continuous representation of words. Specifically, we develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. To obtain large scale training corpora, we learn the sentiment-specific word embedding from massive distant-supervised tweets collected by positive and negative emoticons. Experiments on applying SSWE to a benchmark Twitter sentiment classification dataset in SemEval 2013 show that (1) the SSWE feature performs comparably with hand-crafted features in the top-performed system; (2) the performance is further improved by concatenating SSWE with existing feature set.

1 Introduction

Twitter sentiment classification has attracted increasing research interest in recent years [21, 20]. The objective is to classify the sentiment polarity of a tweet as positive, negative or neutral. The majority of existing approaches follow Pang et al. [33] and employ machine learning algorithms to build classifiers from tweets with manually annotated sentiment polarity. Under this direction, most studies focus on designing effective features to obtain better classification performance. For example, Mohammad et al. (2013) build the top-performed system in the Twitter sentiment classification track of SemEval 2013 [31], using diverse sentiment lexicons and a variety of hand-crafted features.

Feature engineering is important but labor-intensive. It is therefore desirable to discover explanatory factors from the data and make the learning algorithms less dependent on extensive feature engineering [4]. For the task of sentiment classification, an effective feature learning method is to compose the representation of a sentence (or document) from the representations of the words or phrases it contains [40, 47]. Accordingly, it is a crucial step to learn the word representation (or word embedding), which is a dense, low-dimensional and real-valued vector for a word. Although existing word embedding learning algorithms [9, 27] are intuitive choices, they are not effective enough if directly used for sentiment classification. The most serious problem is that traditional methods typically model the syntactic context of words but ignore the sentiment information of text. As a result, words with opposite polarity, such as good and bad, are mapped into close vectors. It is meaningful for some tasks such as pos-tagging [49] as the two words have similar usages and grammatical roles, but it becomes a disaster for sentiment analysis as they have the opposite sentiment polarity.

In this paper, we propose learning sentiment-specific word embedding (SSWE) for sentiment analysis. We encode the sentiment information into the continuous representation of words, so that it is able to separate good and bad to opposite ends of the spectrum. To this end, we extend the existing word embedding learning algorithm [9] and develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. We learn the sentiment-specific word embedding from tweets, leveraging massive tweets with emoticons as distant-supervised corpora without any manual annotations. These automatically collected tweets contain noises so they cannot be directly used as gold training data to build sentiment classifiers, but they are effective enough to provide weakly supervised signals for training the sentiment-specific word embedding.

We apply SSWE as features in a supervised learning framework for Twitter sentiment classification, and evaluate it on the benchmark dataset in SemEval 2013. In the task of predicting positive/negative polarity of tweets, our method yields 84.89% in macro-F1 by only using SSWE as feature, which is comparable to the top-performed system based on hand-crafted features (84.70%). After concatenating the SSWE feature with existing feature set, we push the state-of-the-art to 86.58% in macro-F1. The quality of SSWE is also directly evaluated by measuring the word similarity in the embedding space for sentiment lexicons. In the accuracy of polarity consistency between each sentiment word and its top $N$ closest words, SSWE outperforms existing word embedding learning algorithms.

The major contributions of the work presented in this paper are as follows.

•

We develop three neural networks to learn sentiment-specific word embedding (SSWE) from massive distant-supervised tweets without any manual annotations;
•

To our knowledge, this is the first work that exploits word embedding for Twitter sentiment classification. We report the results that the SSWE feature performs comparably with hand-crafted features in the top-performed system in SemEval 2013;
•

We release the sentiment-specific word embedding learned from 10 million tweets, which can be adopted off-the-shell in other sentiment analysis tasks.

2 Related Work

In this section, we present a brief review of the related work from two perspectives, Twitter sentiment classification and learning continuous representations for sentiment classification.

2.1 Twitter Sentiment Classification

Twitter sentiment classification, which identifies the sentiment polarity of short, informal tweets, has attracted increasing research interest [21, 20] in recent years. Generally, the methods employed in Twitter sentiment classification follow traditional sentiment classification approaches. The lexicon-based approaches [44, 11, 41, 42] mostly use a dictionary of sentiment words with their associated sentiment polarity, and incorporate negation and intensification to compute the sentiment polarity for each sentence (or document).

The learning based methods for Twitter sentiment classification follow Pang et al. (2002)’s work, which treat sentiment classification of texts as a special case of text categorization issue. Many studies on Twitter sentiment classification [32, 10, 1, 22, 48] leverage massive noisy-labeled tweets selected by positive and negative emoticons as training set and build sentiment classifiers directly, which is called distant supervision [17]. Instead of directly using the distant-supervised data as training set, Liu et al. [25] adopt the tweets with emoticons to smooth the language model and Hu et al. [20] incorporate the emotional signals into an unsupervised learning framework for Twitter sentiment classification.

Many existing learning based methods on Twitter sentiment classification focus on feature engineering. The reason is that the performance of sentiment classifier being heavily dependent on the choice of feature representation of tweets. The most representative system is introduced by Mohammad et al. [30], which is the state-of-the-art system (the top-performed system in SemEval 2013 Twitter Sentiment Classification Track) by implementing a number of hand-crafted features. Unlike the previous studies, we focus on learning discriminative features automatically from massive distant-supervised tweets.

2.2 Learning Continuous Representations for Sentiment Classification

Pang et al. (2002) pioneer this field by using bag-of-word representation, representing each word as a one-hot vector. It has the same length as the size of the vocabulary, and only one dimension is 1, with all others being 0. Under this assumption, many feature learning algorithms are proposed to obtain better classification performance [34, 24, 14]. However, the one-hot word representation cannot sufficiently capture the complex linguistic characteristics of words.

With the revival of interest in deep learning [2], incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing [35], language modeling [3, 29] and NER [43]. In the field of sentiment analysis, Bespalov et al. [5, 6] initialize the word embedding by Latent Semantic Analysis and further represent each document as the linear weighted of ngram vectors for sentiment classification. Yessenalina and Cardie [47] model each word as a matrix and combine words using iterated matrix multiplication. Glorot et al. [16] explore Stacked Denoising Autoencoders for domain adaptation in sentiment classification. Socher et al. propose Recursive Neural Network (RNN) [38], matrix-vector RNN [37] and Recursive Neural Tensor Network (RNTN) [40] to learn the compositionality of phrases of any length based on the representation of each pair of children recursively. Hermann et al. [18] present Combinatory Categorial Autoencoders to learn the compositionality of sentence, which marries the Combinatory Categorial Grammar with Recursive Autoencoder.

The representation of words heavily relies on the applications or tasks in which it is used [23]. This paper focuses on learning sentiment-specific word embedding, which is tailored for sentiment analysis. Unlike Maas et al. (2011) that follow the probabilistic document model [7] and give an sentiment predictor function to each word, we develop neural networks and map each ngram to the sentiment polarity of sentence. Unlike Socher et al. (2011c) that utilize manually labeled texts to learn the meaning of phrase (or sentence) through compositionality, we focus on learning the meaning of word, namely word embedding, from massive distant-supervised tweets. Unlike Labutov and Lipson (2013) that produce task-specific embedding from an existing word embedding, we learn sentiment-specific word embedding from scratch.

3 Sentiment-Specific Word Embedding for Twitter Sentiment Classification

In this section, we present the details of learning sentiment-specific word embedding (SSWE) for Twitter sentiment classification. We propose incorporating the sentiment information of sentences to learn continuous representations for words and phrases. We extend the existing word embedding learning algorithm [9] and develop three neural networks to learn SSWE. In the following sections, we introduce the traditional method before presenting the details of SSWE learning algorithms. We then describe the use of SSWE in a supervised learning framework for Twitter sentiment classification.

3.1 C&W Model

Collobert et al. (2011) introduce C&W model to learn word embedding based on the syntactic contexts of words. Given an ngram “cat chills on a mat”, C&W replaces the center word with a random word $w^{r}$ and derives a corrupted ngram “cat chills $w^{r}$ a mat”. The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1. The ranking objective function can be optimized by a hinge loss,

loss_{cw}(t,t^{r})=max(0,1-f^{cw}(t)+f^{cw}(t^{r}))

(1)

where $t$ is the original ngram, $t^{r}$ is the corrupted ngram, $f^{cw}(\cdot)$ is a one-dimensional scalar representing the language model score of the input ngram.

Figure 1: The traditional C&W model and our neural networks (SSWE

{}_{h}

and SSWE

{}_{u}

) for learning sentiment-specific word embedding.

Figure 1(a) illustrates the neural architecture of C&W, which consists of four layers, namely $lookup\rightarrow linear\rightarrow hTanh\rightarrow linear$ (from bottom to top). The original and corrupted ngrams are treated as inputs of the feed-forward neural network, respectively. The output $f^{cw}$ is the language model score of the input, which is calculated as given in Equation 2, where $L$ is the lookup table of word embedding, $w_{1},w_{2},b_{1},b_{2}$ are the parameters of linear layers.

f^{cw}(t)=w_{2}(a)+b_{2}

(2)

a=hTanh(w_{1}L_{t}+b_{1})

(3)

hTanh(x)=\begin{cases}-1&\text{if}\ x<-1\\ x&\text{if}\ -1\leq x\leq 1\\ 1&\text{if}\ x>1\end{cases}

(4)

3.2 Sentiment-Specific Word Embedding

Following the traditional C&W model [9], we incorporate the sentiment information into the neural network to learn sentiment-specific word embedding. We develop three neural networks with different strategies to integrate the sentiment information of tweets.

Basic Model 1 (SSWE ${}_{h}$ ).

As an unsupervised approach, C&W model does not explicitly capture the sentiment information of texts. An intuitive solution to integrate the sentiment information is predicting the sentiment distribution of text based on input ngram. We do not utilize the entire sentence as input because the length of different sentences might be variant. We therefore slide the window of ngram across a sentence, and then predict the sentiment polarity based on each ngram with a shared neural network. In the neural network, the distributed representation of higher layer are interpreted as features describing the input. Thus, we utilize the continuous vector of top layer to predict the sentiment distribution of text.

Assuming there are $K$ labels, we modify the dimension of top layer in C&W model as $K$ and add a $s o f t m a x$ layer upon the top layer. The neural network (SSWE $\bm{{}_{h}}$ ) is given in Figure 1(b). $S o f t m a x$ layer is suitable for this scenario because its outputs are interpreted as conditional probabilities. Unlike C&W, SSWE ${}_{h}$ does not generate any corrupted ngram. Let $\bm{f}^{g}(t)$ , where $K$ denotes the number of sentiment polarity labels, be the gold $K$ -dimensional multinomial distribution of input $t$ and $\sum_{k}\bm{f}_{k}^{g}(t)=1$ . For positive/negative classification, the distribution is of the form [1,0] for positive and [0,1] for negative. The cross-entropy error of the $s o f t m a x$ layer is :

\centering loss_{h}(t)=-\sum_{k=\{0,1\}}\bm{f}_{k}^{g}(t)\cdot log(\bm{f}_{k}^% {h}(t))\@add@centering

(5)

where $\bm{f}^{g}(t)$ is the gold sentiment distribution and $\bm{f}^{h}(t)$ is the predicted sentiment distribution.

Basic Model 2 (SSWE ${}_{r}$ ).

SSWE ${}_{h}$ is trained by predicting the positive ngram as [1,0] and the negative ngram as [0,1]. However, the constraint of SSWE ${}_{h}$ is too strict. The distribution of [0.7,0.3] can also be interpreted as a positive label because the positive score is larger than the negative score. Similarly, the distribution of [0.2,0.8] indicates negative polarity. Based on the above observation, the hard constraints in SSWE ${}_{h}$ should be relaxed. If the sentiment polarity of a tweet is positive, the predicted positive score is expected to be larger than the predicted negative score, and the exact reverse if the tweet has negative polarity.

We model the relaxed constraint with a ranking objective function and borrow the bottom four layers from SSWE ${}_{h}$ , namely $lookup\rightarrow linear\rightarrow hTanh\rightarrow linear$ in Figure 1(b), to build the relaxed neural network (SSWE $\bm{{}_{r}}$ ). Compared with SSWE ${}_{h}$ , the $s o f t m a x$ layer is removed because SSWE ${}_{r}$ does not require probabilistic interpretation. The hinge loss of SSWE ${}_{r}$ is modeled as described below.

	$\displaystyle loss_{r}(t)=max(0,1$	$\displaystyle-\delta_{s}(t)\bm{f}_{0}^{r}(t)$		(6)
		$\displaystyle+\delta_{s}(t)\bm{f}_{1}^{r}(t)\ )$		(6)

where $\bm{f}_{0}^{r}$ is the predicted positive score, $\bm{f}_{1}^{r}$ is the predicted negative score, $\delta_{s}(t)$ is an indicator function reflecting the sentiment polarity of a sentence,

\delta_{s}(t)=\begin{cases}1&\text{if}\ \bm{f}^{g}(t)=[1,0]\\ -1&\text{if}\ \bm{f}^{g}(t)=[0,1]\end{cases}

(7)

Similar with SSWE ${}_{h}$ , SSWE ${}_{r}$ also does not generate the corrupted ngram.

Unified Model (SSWE ${}_{u}$ ).

The C&W model learns word embedding by modeling syntactic contexts of words but ignoring sentiment information. By contrast, SSWE ${}_{h}$ and SSWE ${}_{r}$ learn sentiment-specific word embedding by integrating the sentiment polarity of sentences but leaving out the syntactic contexts of words. We develop a unified model (SSWE $\bm{{}_{u}}$ ) in this part, which captures the sentiment information of sentences as well as the syntactic contexts of words. SSWE ${}_{u}$ is illustrated in Figure 1(c).

Given an original (or corrupted) ngram and the sentiment polarity of a sentence as the input, SSWE ${}_{u}$ predicts a two-dimensional vector for each input ngram. The two scalars ( $\bm{f}_{0}^{u}$ , $\bm{f}_{1}^{u}$ ) stand for language model score and sentiment score of the input ngram, respectively. The training objectives of SSWE ${}_{u}$ are that (1) the original ngram should obtain a higher language model score $\bm{f}_{0}^{u}(t)$ than the corrupted ngram $\bm{f}_{0}^{u}(t^{r})$ , and (2) the sentiment score of original ngram $\bm{f}_{1}^{u}(t)$ should be more consistent with the gold polarity annotation of sentence than corrupted ngram $\bm{f}_{1}^{u}(t^{r})$ . The loss function of SSWE ${}_{u}$ is the linear combination of two hinge losses,

	$\displaystyle loss_{u}(t,t^{r})\ =$	$\displaystyle\alpha\cdot loss_{cw}(t,t^{r})+$		(8)
		$\displaystyle(1-\alpha)\cdot loss_{us}(t,t^{r})$		(8)

where $loss_{cw}(t,t^{r})$ is the syntactic loss as given in Equation 1, $loss_{us}(t,t^{r})$ is the sentiment loss as described in Equation 9. The hyper-parameter $\alpha$ weighs the two parts.

	$\displaystyle loss_{us}(t,t^{r})=max(0,1$	$\displaystyle-\delta_{s}(t)\bm{f}_{1}^{u}(t)$		(9)
		$\displaystyle+\delta_{s}(t)\bm{f}_{1}^{u}(t^{r})\ )$		(9)

Model Training.

We train sentiment-specific word embedding from massive distant-supervised tweets collected with positive and negative emoticons¹¹We use the emoticons selected by Hu et al. [20]. The positive emoticons are :) : ) :-) :D =), and the negative emoticons are :( : ( :-( .. We crawl tweets from April 1st, 2013 to April 30th, 2013 with TwitterAPI. We tokenize each tweet with TwitterNLP [15], remove the @user and URLs of each tweet, and filter the tweets that are too short ( $<$ 7 words). Finally, we collect 10M tweets, selected by 5M tweets with positive emoticons and 5M tweets with negative emoticons.

We train SSWE ${}_{h}$ , SSWE ${}_{r}$ and SSWE ${}_{u}$ by taking the derivative of the loss through back-propagation with respect to the whole set of parameters [9], and use AdaGrad [12] to update the parameters. We empirically set the window size as 3, the embedding length as 50, the length of hidden layer as 20 and the learning rate of AdaGrad as 0.1 for all baseline and our models. We learn embedding for unigrams, bigrams and trigrams separately with same neural network and same parameter setting. The contexts of unigram (bigram/trigram) are the surrounding unigrams (bigrams/trigrams), respectively.

3.3 Twitter Sentiment Classification

We apply sentiment-specific word embedding for Twitter sentiment classification under a supervised learning framework as in previous work [33]. Instead of hand-crafting features, we incorporate the continuous representation of words and phrases as the feature of a tweet. The sentiment classifier is built from tweets with manually annotated sentiment polarity.

We explore $m i n$ , $a v e r a g e$ and $m a x$ convolutional layers [9, 36], which have been used as simple and effective methods for compositionality learning in vector-based semantics [28], to obtain the tweet representation. The result is the concatenation of vectors derived from different convolutional layers.

z(tw)=[z_{max}(tw),z_{min}(tw),z_{average}(tw)]

where $z(tw)$ is the representation of tweet $t w$ and $z_{x}(tw)$ is the results of the convolutional layer $x\in\{min,max,average\}$ . Each convolutional layer $z_{x}$ employs the embedding of unigrams, bigrams and trigrams separately and conducts the matrix-vector operation of $x$ on the sequence represented by columns in each lookup table. The output of $z_{x}$ is the concatenation of results obtained from different lookup tables.

z_{x}(tw)=[w_{x}\langle L_{uni}\rangle^{tw},w_{x}\langle L_{bi}\rangle^{tw},w_% {x}\langle L_{tri}\rangle^{tw}]

where $w_{x}$ is the convolutional function of $z_{x}$ , $\langle L\rangle^{tw}$ is the concatenated column vectors of the words in the tweet. $L_{uni}$ , $L_{bi}$ and $L_{tri}$ are the lookup tables of the unigram, bigram and trigram embedding, respectively.

4 Experiment

We conduct experiments to evaluate SSWE by incorporating it into a supervised learning framework for Twitter sentiment classification. We also directly evaluate the effectiveness of the SSWE by measuring the word similarity in the embedding space for sentiment lexicons.

4.1 Twitter Sentiment Classification

Experiment Setup and Datasets.

We conduct experiments on the latest Twitter sentiment classification benchmark dataset in SemEval 2013 [31]. The training and development sets were completely in full to task participants. However, we were unable to download all the training and development sets because some tweets were deleted or not available due to modified authorization status. The test set is directly provided to the participants. The distribution of our dataset is given in Table 1. We train sentiment classifier with LibLinear [13] on the training set, tune parameter $-c$ on the dev set and evaluate on the test set. Evaluation metric is the Macro-F1 of positive and negative categories ²²We investigate 2-class Twitter sentiment classification (positive/negative) instead of 3-class Twitter sentiment classification (positive/negative/neutral) in SemEval2013..

	Positive	Negative	Neutral	Total
Train	2,642	994	3,436	7,072
Dev	408	219	493	1,120
Test	1,570	601	1,639	3,810

Table 1: Statistics of the SemEval 2013 Twitter sentiment classification dataset.

Baseline Methods.

We compare our method with the following sentiment classification algorithms:

(1) DistSuper: We use the 10 million tweets selected by positive and negative emoticons as training data, and build sentiment classifier with LibLinear and ngram features [17].

(2) SVM: The ngram features and Support Vector Machine are widely used baseline methods to build sentiment classifiers [33]. LibLinear is used to train the SVM classifier.

(3) NBSVM: NBSVM [45] is a state-of-the-art performer on many sentiment classification datasets, which trades-off between Naive Bayes and NB-enhanced SVM.

(4) RAE: Recursive Autoencoder [39] has been proven effective in many sentiment analysis tasks by learning compositionality automatically. We run RAE with randomly initialized word embedding.

(5) NRC: NRC builds the top-performed system in SemEval 2013 Twitter sentiment classification track which incorporates diverse sentiment lexicons and many manually designed features. We re-implement this system because the codes are not publicly available ³³For 3-class sentiment classification in SemEval 2013, our re-implementation of NRC achieved 68.3%, 0.7% lower than NRC (69%) due to less training data.. NRC-ngram refers to the feature set of NRC leaving out ngram features.

Except for $D i s t S u p e r$ , other baseline methods are conducted in a supervised manner. We do not compare with RNTN [40] because we cannot efficiently train the RNTN model. The reason lies in that the tweets in our dataset do not have accurately parsed results or fine grained sentiment labels for phrases. Another reason is that the RNTN model trained on movie reviews cannot be directly applied on tweets due to the differences between domains [8].

Results and Analysis.

Table 2 shows the macro-F1 of the baseline systems as well as the SSWE-based methods on positive/negative sentiment classification of tweets.

Method	Macro-F1
DistSuper + unigram	61.74
DistSuper + uni/bi/tri-gram	63.84
SVM + unigram	74.50
SVM + uni/bi/tri-gram	75.06
NBSVM	75.28
RAE	75.12
NRC (Top System in SemEval)	84.73
NRC - ngram	84.17
SSWE ${}_{u}$	84.98
SSWE ${}_{u}$ +NRC	86.58
SSWE ${}_{u}$ +NRC-ngram	86.48

Table 2: Macro-F1 on positive/negative classification of tweets.

Distant supervision is relatively weak because the noisy-labeled tweets are treated as the gold standard, which affects the performance of classifier. The results of bag-of-ngram (uni/bi/tri-gram) features are not satisfied because the one-hot word representation cannot capture the latent connections between words. NBSVM and RAE perform comparably and have a big gap in comparison with the NRC and SSWE-based methods. The reason is that RAE and NBSVM learn the representation of tweets from the small-scale manually annotated training set, which cannot well capture the comprehensive linguistic phenomenons of words.

NRC implements a variety of features and reaches 84.73% in macro-F1, verifying the importance of a better feature representation for Twitter sentiment classification. We achieve 84.98% by using only SSWE ${}_{u}$ as features without borrowing any sentiment lexicons or hand-crafted rules. The results indicate that SSWE ${}_{u}$ automatically learns discriminative features from massive tweets and performs comparable with the state-of-the-art manually designed features. After concatenating SSWE ${}_{u}$ with the feature set of NRC, the performance is further improved to 86.58%. We also compare SSWE ${}_{u}$ with the ngram feature by integrating SSWE into NRC-ngram. The concatenated features SSWE ${}_{u}$ +NRC-ngram (86.48%) outperform the original feature set of NRC (84.73%).

As a reference, we apply SSWE ${}_{u}$ on subjective classification of tweets, and obtain 72.17% in macro-F1 by using only SSWE ${}_{u}$ as feature. After combining SSWE ${}_{u}$ with the feature set of NRC, we improve NRC from 74.86% to 75.39% for subjective classification.

Comparision between Different Word Embedding.

We compare sentiment-specific word embedding (SSWE ${}_{h}$ , SSWE ${}_{r}$ , SSWE ${}_{u}$ ) with baseline embedding learning algorithms by only using word embedding as features for Twitter sentiment classification. We use the embedding of unigrams, bigrams and trigrams in the experiment. The embeddings of C&W [9], word2vec⁴⁴Available at https://code.google.com/p/word2vec/. We utilize the Skip-gram model because it performs better than CBOW in our experiments., WVSA [26] and our models are trained with the same dataset and same parameter setting. We compare with C&W and word2vec as they have been proved effective in many NLP tasks. The trade-off parameter of ReEmb [23] is tuned on the development set of SemEval 2013.

Table 3 shows the performance on the positive/negative classification of tweets⁵⁵MVSA and ReEmb are not suitable for learning bigram and trigram embedding because their sentiment predictor functions only utilize the unigram embedding.. ReEmb(C&W) and ReEmb(w2v) stand for the use of embeddings learned from 10 million distant-supervised tweets with C&W and word2vec, respectively. Each row of Table 3 represents a word embedding learning algorithm. Each column stands for a type of embedding used to compose features of tweets. The column uni+bi denotes the use of unigram and bigram embedding, and the column uni+bi+tri indicates the use of unigram, bigram and trigram embedding.

Embedding	unigram	uni+bi	uni+bi+tri
C&W	74.89	75.24	75.89
Word2vec	73.21	75.07	76.31
ReEmb(C&W)	75.87	–	–
ReEmb(w2v)	75.21	–	–
WVSA	77.04	–	–
SSWE ${}_{h}$	81.33	83.16	83.37
SSWE ${}_{r}$	80.45	81.52	82.60
SSWE ${}_{u}$	83.70	84.70	84.98

Table 3: Macro-F1 on positive/negative classification of tweets with different word embeddings.

From the first column of Table 3, we can see that the performance of C&W and word2vec are obviously lower than sentiment-specific word embeddings by only using unigram embedding as features. The reason is that C&W and word2vec do not explicitly exploit the sentiment information of the text, resulting in that the words with opposite polarity such as good and bad are mapped to close word vectors. When such word embeddings are fed as features to a Twitter sentiment classifier, the discriminative ability of sentiment words are weakened thus the classification performance is affected. Sentiment-specific word embeddings (SSWE ${}_{h}$ , SSWE ${}_{r}$ , SSWE ${}_{u}$ ) effectively distinguish words with opposite sentiment polarity and perform best in three settings. SSWE outperforms MVSA by exploiting more contextual information in the sentiment predictor function. SSWE outperforms ReEmb by leveraging more sentiment information from massive distant-supervised tweets. Among three sentiment-specific word embeddings, SSWE ${}_{u}$ captures more context information and yields best performance. SSWE ${}_{h}$ and SSWE ${}_{r}$ obtain comparative results.

From each row of Table 3, we can see that the bigram and trigram embeddings consistently improve the performance of Twitter sentiment classification. The underlying reason is that a phrase, which cannot be accurately represented by unigram embedding, is directly encoded into the ngram embedding as an idiomatic unit. A typical case in sentiment analysis is that the composed phrase and multiword expression may have a different sentiment polarity than the individual words it contains, such as not [bad] and [great] deal of (the word in the bracket has different sentiment polarity with the ngram). A very recent study by Mikolov et al. [27] also verified the effectiveness of phrase embedding for analogically reasoning phrases.

Effect of $\alpha$ in SSWE ${}_{u}$

We tune the hyper-parameter $\alpha$ of SSWE ${}_{u}$ on the development set by using unigram embedding as features. As given in Equation 8, $\alpha$ is the weighting score of syntactic loss of SSWE ${}_{u}$ and trades-off the syntactic and sentiment losses. SSWE ${}_{u}$ is trained from 10 million distant-supervised tweets.

Figure 2: Macro-F1 of SSWE

{}_{u}

on the development set of SemEval 2013 with different

\alpha

Figure 2 shows the macro-F1 of SSWE ${}_{u}$ on positive/negative classification of tweets with different $\alpha$ on our development set. We can see that SSWE ${}_{u}$ performs better when $\alpha$ is in the range of [0.5, 0.6], which balances the syntactic context and sentiment information. The model with $\alpha$ =1 stands for C&W model, which only encodes the syntactic contexts of words. The sharp decline at $\alpha$ =1 reflects the importance of sentiment information in learning word embedding for Twitter sentiment classification.

Effect of Distant-supervised Data in SSWE ${}_{u}$

We investigate how the size of the distant-supervised data affects the performance of SSWE ${}_{u}$ feature for Twitter sentiment classification. We vary the number of distant-supervised tweets from 1 million to 12 million, increased by 1 million. We set the $\alpha$ of SSWE ${}_{u}$ as 0.5, according to the experiments shown in Figure 2. Results of positive/negative classification of tweets on our development set are given in Figure 3.

Figure 3: Macro-F1 of SSWE

{}_{u}

with different size of distant-supervised data on our development set.

We can see that when more distant-supervised tweets are added, the accuracy of SSWE ${}_{u}$ consistently improves. The underlying reason is that when more tweets are incorporated, the word embedding is better estimated as the vocabulary size is larger and the context and sentiment information are richer. When we have 10 million distant-supervised tweets, the SSWE ${}_{u}$ feature increases the macro-F1 of positive/negative classification of tweets to 82.94% on our development set. When we have more than 10 million tweets, the performance remains stable as the contexts of words have been mostly covered.

4.2 Word Similarity of Sentiment Lexicons

The quality of SSWE has been implicitly evaluated when applied in Twitter sentiment classification in the previous subsection. We explicitly evaluate it in this section through word similarity in the embedding space for sentiment lexicons. The evaluation metric is the accuracy of polarity consistency between each sentiment word and its top $N$ closest words in the sentiment lexicon,

Accuracy=\frac{\sum_{i=1}^{\#Lex}\sum_{j=1}^{N}\beta(w_{i},c_{ij})}{\#Lex% \times N}

(10)

where $\#Lex$ is the number of words in the sentiment lexicon, $w_{i}$ is the i-th word in the lexicon, $c_{ij}$ is the j-th closest word to $w_{i}$ in the lexicon with cosine similarity, $\beta(w_{i},c_{ij})$ is an indicator function that is equal to 1 if $w_{i}$ and $c_{ij}$ have the same sentiment polarity and 0 for the opposite case. The higher accuracy refers to a better polarity consistency of words in the sentiment lexicon. We set $N$ as 100 in our experiment.

Experiment Setup and Datasets

We utilize the widely-used sentiment lexicons, namely MPQA [46] and HL [19], to evaluate the quality of word embedding. For each lexicon, we remove the words that do not appear in the lookup table of word embedding. We only use unigram embedding in this section because these sentiment lexicons do not contain phrases. The distribution of the lexicons used in this paper is listed in Table 4.

Lexicon	Positive	Negative	Total
HL	1,331	2,647	3,978
MPQA	1,932	2,817	4,749
Joint	1,051	2,024	3,075

Table 4: Statistics of the sentiment lexicons. Joint stands for the words that occur in both HL and MPQA with the same sentiment polarity.

Results.

Table 5 shows our results compared to other word embedding learning algorithms. The accuracy of random result is 50% as positive and negative words are randomly occurred in the nearest neighbors of each word. Sentiment-specific word embeddings (SSWE ${}_{h}$ , SSWE ${}_{r}$ , SSWE ${}_{u}$ ) outperform existing neural models (C&W, word2vec) by large margins.

Embedding	HL	MPQA	Joint
Random	50.00	50.00	50.00
C&W	63.10	58.13	62.58
Word2vec	66.22	60.72	65.59
ReEmb(C&W)	64.81	59.76	64.09
ReEmb(w2v)	67.16	61.81	66.39
WVSA	68.14	64.07	67.12
SSWE ${}_{h}$	74.17	68.36	74.03
SSWE ${}_{r}$	73.65	68.02	73.14
SSWE ${}_{u}$	77.30	71.74	77.33

Table 5: Accuracy of the polarity consistency of words in different sentiment lexicons.

SSWE ${}_{u}$ performs best in three lexicons. SSWE ${}_{h}$ and SSWE ${}_{r}$ have comparable performances. Experimental results further demonstrate that sentiment-specific word embeddings are able to capture the sentiment information of texts and distinguish words with opposite sentiment polarity, which are not well solved in traditional neural models like C&W and word2vec. SSWE outperforms MVSA and ReEmb by exploiting more context information of words and sentiment information of sentences, respectively.

5 Conclusion

In this paper, we propose learning continuous word representations as features for Twitter sentiment classification under a supervised learning framework. We show that the word embedding learned by traditional neural networks are not effective enough for Twitter sentiment classification. These methods typically only model the context information of words so that they cannot distinguish words with similar context but opposite sentiment polarity (e.g. good and bad). We learn sentiment-specific word embedding (SSWE) by integrating the sentiment information into the loss functions of three neural networks. We train SSWE with massive distant-supervised tweets selected by positive and negative emoticons. The effectiveness of SSWE has been implicitly evaluated by using it as features in sentiment classification on the benchmark dataset in SemEval 2013, and explicitly verified by measuring word similarity in the embedding space for sentiment lexicons. Our unified model combining syntactic context of words and sentiment information of sentences yields the best performance in both experiments.

Acknowledgments

We thank Yajuan Duan, Shujie Liu, Zhenghua Li, Li Dong, Hong Sun and Lanjun Zhou for their great help. This research was partly supported by National Natural Science Foundation of China (No.61133012, No.61273321, No.61300113).

References

[1] L. Barbosa and J. Feng(2010) Robust sentiment detection on twitter from biased and noisy data. pp. 36–44. Cited by: 2.1.
[2] Y. Bengio, A. Courville and P. Vincent(2013) Representation learning: a review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence. Cited by: 2.2.
[3] Y. Bengio, R. Ducharme, P. Vincent and C. Janvin(2003) A neural probabilistic language model. Journal of Machine Learning Research 3, pp. 1137–1155. Cited by: 2.2.
[4] Y. Bengio(2013) Deep learning of representations: looking forward. arXiv preprint arXiv:1305.0445. Cited by: 1.
[5] D. Bespalov, B. Bai, Y. Qi and A. Shokoufandeh(2011) Sentiment classification based on supervised latent n-gram analysis. pp. 375–382. Cited by: 2.2.
[6] D. Bespalov, Y. Qi, B. Bai and A. Shokoufandeh(2012) Sentiment classification with supervised sequence embedding. Machine Learning and Knowledge Discovery in Databases, pp. 159–174. Cited by: 2.2.
[7] D. M. Blei, A. Y. Ng and M. I. Jordan(2003) Latent dirichlet allocation. the Journal of machine Learning research 3, pp. 993–1022. Cited by: 2.2.
[8] J. Blitzer, M. Dredze and F. Pereira(2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Vol. 7. Cited by: 4.1.
[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: 1, 1, 3.1, 3.2, 3.2, 3.3, 3, 4.1.
[10] D. Davidov, O. Tsur and A. Rappoport(2010) Enhanced sentiment learning using twitter hashtags and smileys. pp. 241–249. Cited by: 2.1.
[11] X. Ding, B. Liu and P. S. Yu(2008) A holistic lexicon-based approach to opinion mining. pp. 231–240. Cited by: 2.1.
[12] J. Duchi, E. Hazan and Y. Singer(2011) Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, pp. 2121–2159. Cited by: 3.2.
[13] R. Fan, K. Chang, C. Hsieh, X. Wang and C. Lin(2008) LIBLINEAR: a library for large linear classification. The Journal of Machine Learning Research 9, pp. 1871–1874. Cited by: 4.1.
[14] R. Feldman(2013) Techniques and applications for sentiment analysis. Communications of the ACM 56 (4), pp. 82–89. Cited by: 2.2.
[15] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan and N. A. Smith(2011) Part-of-speech tagging for twitter: annotation, features, and experiments. pp. 42–47. External Links: Link Cited by: 3.2.
[16] X. Glorot, A. Bordes and Y. Bengio(2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. Proceedings of International Conference on Machine Learning. Cited by: 2.2.
[17] A. Go, R. Bhayani and L. Huang(2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, pp. 1–12. Cited by: 2.1, 4.1.
[18] K. M. Hermann and P. Blunsom(2013) The role of syntax in vector space models of compositional semantics. pp. 894–904. External Links: Link Cited by: 2.2.
[19] M. Hu and B. Liu(2004) Mining and summarizing customer reviews. pp. 168–177. Cited by: 4.2.
[20] X. Hu, J. Tang, H. Gao and H. Liu(2013) Unsupervised sentiment analysis with emotional signals. pp. 607–618. Cited by: 1, 2.1, 2.1, 3.2.
[21] L. Jiang, M. Yu, M. Zhou, X. Liu and T. Zhao(2011) Target-dependent twitter sentiment classification. The Proceeding of Annual Meeting of the Association for Computational Linguistics 1, pp. 151–160. Cited by: 1, 2.1.
[22] E. Kouloumpis, T. Wilson and J. Moore(2011) Twitter sentiment analysis: the good the bad and the omg!. Cited by: 2.1.
[23] I. Labutov and H. Lipson(2013) Re-embedding words. Cited by: 2.2, 4.1.
[24] B. Liu(2012) Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies 5 (1), pp. 1–167. Cited by: 2.2.
[25] K. Liu, W. Li and M. Guo(2012) Emoticon smoothed language models for twitter sentiment analysis. Cited by: 2.1.
[26] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng and C. Potts(2011) Learning word vectors for sentiment analysis. Cited by: 2.2, 4.1.
[27] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean(2013) Distributed representations of words and phrases and their compositionality. The Conference on Neural Information Processing Systems. Cited by: 1, 4.1.
[28] J. Mitchell and M. Lapata(2010) Composition in distributional models of semantics. Cognitive Science 34 (8), pp. 1388–1429. Cited by: 3.3.
[29] A. Mnih and G. E. Hinton(2009) A scalable hierarchical distributed language model. pp. 1081–1088. Cited by: 2.2.
[30] S. M. Mohammad, S. Kiritchenko and X. Zhu(2013) NRC-canada: building the state-of-the-art in sentiment analysis of tweets. Proceedings of the International Workshop on Semantic Evaluation. Cited by: 1, 2.1.
[31] P. Nakov, S. Rosenthal, Z. Kozareva, V. Stoyanov, A. Ritter and T. Wilson(2013) SemEval-2013 task 2: sentiment analysis in twitter. Vol. 13. Cited by: 1, 4.1.
[32] A. Pak and P. Paroubek(2010) Twitter as a corpus for sentiment analysis and opinion mining. Vol. 2010. Cited by: 2.1.
[33] B. Pang, L. Lee and S. Vaithyanathan(2002) Thumbs up?: sentiment classification using machine learning techniques. pp. 79–86. Cited by: 1, 2.1, 2.2, 3.3, 4.1.
[34] B. Pang and L. Lee(2008) Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2 (1-2), pp. 1–135. Cited by: 2.2.
[35] R. Socher, J. Bauer, C. D. Manning and A. Y. Ng(2013) Parsing with compositional vector grammars. Annual Meeting of the Association for Computational Linguistics, Cited by: 2.2.
[36] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng and C. D. Manning(2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. The Conference on Neural Information Processing Systems 24, pp. 801–809. Cited by: 3.3.
[37] R. Socher, B. Huval, C. D. Manning and A. Y. Ng(2012) Semantic Compositionality Through Recursive Matrix-Vector Spaces. Cited by: 2.2.
[38] R. Socher, C. C. Lin, A. Ng and C. Manning(2011) Parsing natural scenes and natural language with recursive neural networks. pp. 129–136. Cited by: 2.2.
[39] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng and C.D. Manning(2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. pp. 151–161. Cited by: 2.2, 4.1.
[40] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng and C. Potts(2013) Recursive deep models for semantic compositionality over a sentiment treebank. pp. 1631–1642. External Links: Link Cited by: 1, 2.2, 4.1.
[41] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede(2011) Lexicon-based methods for sentiment analysis. Computational linguistics 37 (2), pp. 267–307. Cited by: 2.1.
[42] M. Thelwall, K. Buckley and G. Paltoglou(2012) Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology 63 (1), pp. 163–173. Cited by: 2.1.
[43] J. Turian, L. Ratinov and Y. Bengio(2010) Word representations: a simple and general method for semi-supervised learning. Annual Meeting of the Association for Computational Linguistics. Cited by: 2.2.
[44] P. D. Turney(2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. pp. 417–424. Cited by: 2.1.
[45] S. Wang and C. D. Manning(2012) Baselines and bigrams: simple, good sentiment and topic classification. pp. 90–94. Cited by: 4.1.
[46] T. Wilson, J. Wiebe and P. Hoffmann(2005) Recognizing contextual polarity in phrase-level sentiment analysis. pp. 347–354. Cited by: 4.2.
[47] A. Yessenalina and C. Cardie(2011) Compositional matrix-space models for sentiment analysis. pp. 172–182. Cited by: 1, 2.2.
[48] J. Zhao, L. Dong, J. Wu and K. Xu(2012) MoodLens: an emoticon-based sentiment analysis system for chinese tweets. Cited by: 2.1.
[49] X. Zheng, H. Chen and T. Xu(2013) Deep learning for chinese word segmentation and pos tagging. pp. 647–657. Cited by: 1.

Generated on Tue Jun 10 19:39:12 2014 by LaTeXML [LOGO]

Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification††thanks: This work was done when the first and third authors were visiting Microsoft Research Asia.