Learning Sentiment-Specific Word Embedding
for Twitter Sentiment Classificationthanks:   This work was done when the first and third authors were visiting Microsoft Research Asia.

Duyu Tang, Furu Wei , Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Research Center for Social Computing and Information Retrieval
Harbin Institute of Technology, China
Microsoft Research, Beijing, China
University of Science and Technology of China, Hefei, China
{dytang, tliu, qinb}@ir.hit.edu.cn
{fuwei, v-nayang, mingzhou}@microsoft.com
Abstract

We present a method that learns word embedding for Twitter sentiment classification in this paper. Most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text. This is problematic for sentiment analysis as they usually map words with similar syntactic context but opposite sentiment polarity, such as good and bad, to neighboring word vectors. We address this issue by learning sentiment-specific word embedding (SSWE), which encodes sentiment information in the continuous representation of words. Specifically, we develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. To obtain large scale training corpora, we learn the sentiment-specific word embedding from massive distant-supervised tweets collected by positive and negative emoticons. Experiments on applying SSWE to a benchmark Twitter sentiment classification dataset in SemEval 2013 show that (1) the SSWE feature performs comparably with hand-crafted features in the top-performed system; (2) the performance is further improved by concatenating SSWE with existing feature set.

1 Introduction

Twitter sentiment classification has attracted increasing research interest in recent years [21, 20]. The objective is to classify the sentiment polarity of a tweet as positive, negative or neutral. The majority of existing approaches follow Pang et al. [33] and employ machine learning algorithms to build classifiers from tweets with manually annotated sentiment polarity. Under this direction, most studies focus on designing effective features to obtain better classification performance. For example, Mohammad et al. (2013) build the top-performed system in the Twitter sentiment classification track of SemEval 2013 [31], using diverse sentiment lexicons and a variety of hand-crafted features.

Feature engineering is important but labor-intensive. It is therefore desirable to discover explanatory factors from the data and make the learning algorithms less dependent on extensive feature engineering [4]. For the task of sentiment classification, an effective feature learning method is to compose the representation of a sentence (or document) from the representations of the words or phrases it contains [40, 47]. Accordingly, it is a crucial step to learn the word representation (or word embedding), which is a dense, low-dimensional and real-valued vector for a word. Although existing word embedding learning algorithms [9, 27] are intuitive choices, they are not effective enough if directly used for sentiment classification. The most serious problem is that traditional methods typically model the syntactic context of words but ignore the sentiment information of text. As a result, words with opposite polarity, such as good and bad, are mapped into close vectors. It is meaningful for some tasks such as pos-tagging [49] as the two words have similar usages and grammatical roles, but it becomes a disaster for sentiment analysis as they have the opposite sentiment polarity.

In this paper, we propose learning sentiment-specific word embedding (SSWE) for sentiment analysis. We encode the sentiment information into the continuous representation of words, so that it is able to separate good and bad to opposite ends of the spectrum. To this end, we extend the existing word embedding learning algorithm [9] and develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. We learn the sentiment-specific word embedding from tweets, leveraging massive tweets with emoticons as distant-supervised corpora without any manual annotations. These automatically collected tweets contain noises so they cannot be directly used as gold training data to build sentiment classifiers, but they are effective enough to provide weakly supervised signals for training the sentiment-specific word embedding.

We apply SSWE as features in a supervised learning framework for Twitter sentiment classification, and evaluate it on the benchmark dataset in SemEval 2013. In the task of predicting positive/negative polarity of tweets, our method yields 84.89% in macro-F1 by only using SSWE as feature, which is comparable to the top-performed system based on hand-crafted features (84.70%). After concatenating the SSWE feature with existing feature set, we push the state-of-the-art to 86.58% in macro-F1. The quality of SSWE is also directly evaluated by measuring the word similarity in the embedding space for sentiment lexicons. In the accuracy of polarity consistency between each sentiment word and its top N closest words, SSWE outperforms existing word embedding learning algorithms.

The major contributions of the work presented in this paper are as follows.

  • We develop three neural networks to learn sentiment-specific word embedding (SSWE) from massive distant-supervised tweets without any manual annotations;

  • To our knowledge, this is the first work that exploits word embedding for Twitter sentiment classification. We report the results that the SSWE feature performs comparably with hand-crafted features in the top-performed system in SemEval 2013;

  • We release the sentiment-specific word embedding learned from 10 million tweets, which can be adopted off-the-shell in other sentiment analysis tasks.

2 Related Work

In this section, we present a brief review of the related work from two perspectives, Twitter sentiment classification and learning continuous representations for sentiment classification.

2.1 Twitter Sentiment Classification

Twitter sentiment classification, which identifies the sentiment polarity of short, informal tweets, has attracted increasing research interest [21, 20] in recent years. Generally, the methods employed in Twitter sentiment classification follow traditional sentiment classification approaches. The lexicon-based approaches [44, 11, 41, 42] mostly use a dictionary of sentiment words with their associated sentiment polarity, and incorporate negation and intensification to compute the sentiment polarity for each sentence (or document).

The learning based methods for Twitter sentiment classification follow Pang et al. (2002)’s work, which treat sentiment classification of texts as a special case of text categorization issue. Many studies on Twitter sentiment classification [32, 10, 1, 22, 48] leverage massive noisy-labeled tweets selected by positive and negative emoticons as training set and build sentiment classifiers directly, which is called distant supervision [17]. Instead of directly using the distant-supervised data as training set, Liu et al. [25] adopt the tweets with emoticons to smooth the language model and Hu et al. [20] incorporate the emotional signals into an unsupervised learning framework for Twitter sentiment classification.

Many existing learning based methods on Twitter sentiment classification focus on feature engineering. The reason is that the performance of sentiment classifier being heavily dependent on the choice of feature representation of tweets. The most representative system is introduced by Mohammad et al. [30], which is the state-of-the-art system (the top-performed system in SemEval 2013 Twitter Sentiment Classification Track) by implementing a number of hand-crafted features. Unlike the previous studies, we focus on learning discriminative features automatically from massive distant-supervised tweets.

2.2 Learning Continuous Representations for Sentiment Classification

Pang et al. (2002) pioneer this field by using bag-of-word representation, representing each word as a one-hot vector. It has the same length as the size of the vocabulary, and only one dimension is 1, with all others being 0. Under this assumption, many feature learning algorithms are proposed to obtain better classification performance [34, 24, 14]. However, the one-hot word representation cannot sufficiently capture the complex linguistic characteristics of words.

With the revival of interest in deep learning [2], incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing [35], language modeling [3, 29] and NER [43]. In the field of sentiment analysis, Bespalov et al. [5, 6] initialize the word embedding by Latent Semantic Analysis and further represent each document as the linear weighted of ngram vectors for sentiment classification. Yessenalina and Cardie [47] model each word as a matrix and combine words using iterated matrix multiplication. Glorot et al. [16] explore Stacked Denoising Autoencoders for domain adaptation in sentiment classification. Socher et al. propose Recursive Neural Network (RNN) [38], matrix-vector RNN [37] and Recursive Neural Tensor Network (RNTN) [40] to learn the compositionality of phrases of any length based on the representation of each pair of children recursively. Hermann et al. [18] present Combinatory Categorial Autoencoders to learn the compositionality of sentence, which marries the Combinatory Categorial Grammar with Recursive Autoencoder.

The representation of words heavily relies on the applications or tasks in which it is used [23]. This paper focuses on learning sentiment-specific word embedding, which is tailored for sentiment analysis. Unlike Maas et al. (2011) that follow the probabilistic document model [7] and give an sentiment predictor function to each word, we develop neural networks and map each ngram to the sentiment polarity of sentence. Unlike Socher et al. (2011c) that utilize manually labeled texts to learn the meaning of phrase (or sentence) through compositionality, we focus on learning the meaning of word, namely word embedding, from massive distant-supervised tweets. Unlike Labutov and Lipson (2013) that produce task-specific embedding from an existing word embedding, we learn sentiment-specific word embedding from scratch.

3 Sentiment-Specific Word Embedding for Twitter Sentiment Classification

In this section, we present the details of learning sentiment-specific word embedding (SSWE) for Twitter sentiment classification. We propose incorporating the sentiment information of sentences to learn continuous representations for words and phrases. We extend the existing word embedding learning algorithm [9] and develop three neural networks to learn SSWE. In the following sections, we introduce the traditional method before presenting the details of SSWE learning algorithms. We then describe the use of SSWE in a supervised learning framework for Twitter sentiment classification.

3.1 C&W Model

Collobert et al. (2011) introduce C&W model to learn word embedding based on the syntactic contexts of words. Given an ngram “cat chills on a mat”, C&W replaces the center word with a random word wr and derives a corrupted ngram “cat chills wr a mat”. The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1. The ranking objective function can be optimized by a hinge loss,

losscw(t,tr)=max(0,1-fcw(t)+fcw(tr)) (1)

where t is the original ngram, tr is the corrupted ngram, fcw() is a one-dimensional scalar representing the language model score of the input ngram.

Figure 1: The traditional C&W model and our neural networks (SSWEh and SSWEu) for learning sentiment-specific word embedding.

Figure 1(a) illustrates the neural architecture of C&W, which consists of four layers, namely lookuplinearhTanhlinear (from bottom to top). The original and corrupted ngrams are treated as inputs of the feed-forward neural network, respectively. The output fcw is the language model score of the input, which is calculated as given in Equation 2, where L is the lookup table of word embedding, w1,w2,b1,b2 are the parameters of linear layers.

fcw(t)=w2(a)+b2 (2)
a=hTanh(w1Lt+b1) (3)
hTanh(x)={-1ifx<-1xif-1x11ifx>1 (4)

3.2 Sentiment-Specific Word Embedding

Following the traditional C&W model [9], we incorporate the sentiment information into the neural network to learn sentiment-specific word embedding. We develop three neural networks with different strategies to integrate the sentiment information of tweets.

Basic Model 1 (SSWEh).

As an unsupervised approach, C&W model does not explicitly capture the sentiment information of texts. An intuitive solution to integrate the sentiment information is predicting the sentiment distribution of text based on input ngram. We do not utilize the entire sentence as input because the length of different sentences might be variant. We therefore slide the window of ngram across a sentence, and then predict the sentiment polarity based on each ngram with a shared neural network. In the neural network, the distributed representation of higher layer are interpreted as features describing the input. Thus, we utilize the continuous vector of top layer to predict the sentiment distribution of text.

Assuming there are K labels, we modify the dimension of top layer in C&W model as K and add a softmax layer upon the top layer. The neural network (SSWEh) is given in Figure 1(b). Softmax layer is suitable for this scenario because its outputs are interpreted as conditional probabilities. Unlike C&W, SSWEh does not generate any corrupted ngram. Let 𝒇g(t), where K denotes the number of sentiment polarity labels, be the gold K-dimensional multinomial distribution of input t and k𝒇kg(t)=1. For positive/negative classification, the distribution is of the form [1,0] for positive and [0,1] for negative. The cross-entropy error of the softmax layer is :

lossh(t)=-k={0,1}𝒇kg(t)log(𝒇kh(t)) (5)

where 𝒇g(t) is the gold sentiment distribution and 𝒇h(t) is the predicted sentiment distribution.

Basic Model 2 (SSWEr).

SSWEh is trained by predicting the positive ngram as [1,0] and the negative ngram as [0,1]. However, the constraint of SSWEh is too strict. The distribution of [0.7,0.3] can also be interpreted as a positive label because the positive score is larger than the negative score. Similarly, the distribution of [0.2,0.8] indicates negative polarity. Based on the above observation, the hard constraints in SSWEh should be relaxed. If the sentiment polarity of a tweet is positive, the predicted positive score is expected to be larger than the predicted negative score, and the exact reverse if the tweet has negative polarity.

We model the relaxed constraint with a ranking objective function and borrow the bottom four layers from SSWEh, namely lookuplinearhTanhlinear in Figure 1(b), to build the relaxed neural network (SSWEr). Compared with SSWEh, the softmax layer is removed because SSWEr does not require probabilistic interpretation. The hinge loss of SSWEr is modeled as described below.

lossr(t)=max(0,1 -δs(t)𝒇0r(t) (6)
+δs(t)𝒇1r(t))

where 𝒇0r is the predicted positive score, 𝒇1r is the predicted negative score, δs(t) is an indicator function reflecting the sentiment polarity of a sentence,

δs(t)={1if𝒇g(t)=[1,0]-1if𝒇g(t)=[0,1] (7)

Similar with SSWEh, SSWEr also does not generate the corrupted ngram.

Unified Model (SSWEu).

The C&W model learns word embedding by modeling syntactic contexts of words but ignoring sentiment information. By contrast, SSWEh and SSWEr learn sentiment-specific word embedding by integrating the sentiment polarity of sentences but leaving out the syntactic contexts of words. We develop a unified model (SSWEu) in this part, which captures the sentiment information of sentences as well as the syntactic contexts of words. SSWEu is illustrated in Figure 1(c).

Given an original (or corrupted) ngram and the sentiment polarity of a sentence as the input, SSWEu predicts a two-dimensional vector for each input ngram. The two scalars (𝒇0u, 𝒇1u) stand for language model score and sentiment score of the input ngram, respectively. The training objectives of SSWEu are that (1) the original ngram should obtain a higher language model score 𝒇0u(t) than the corrupted ngram 𝒇0u(tr), and (2) the sentiment score of original ngram 𝒇1u(t) should be more consistent with the gold polarity annotation of sentence than corrupted ngram 𝒇1u(tr). The loss function of SSWEu is the linear combination of two hinge losses,

lossu(t,tr)= αlosscw(t,tr)+ (8)
(1-α)lossus(t,tr)

where losscw(t,tr) is the syntactic loss as given in Equation 1, lossus(t,tr) is the sentiment loss as described in Equation 9. The hyper-parameter α weighs the two parts.

lossus(t,tr)=max(0,1 -δs(t)𝒇1u(t) (9)
+δs(t)𝒇1u(tr))

Model Training.

We train sentiment-specific word embedding from massive distant-supervised tweets collected with positive and negative emoticons11We use the emoticons selected by Hu et al. [20]. The positive emoticons are :) : ) :-) :D =), and the negative emoticons are :( : ( :-( .. We crawl tweets from April 1st, 2013 to April 30th, 2013 with TwitterAPI. We tokenize each tweet with TwitterNLP [15], remove the @user and URLs of each tweet, and filter the tweets that are too short (< 7 words). Finally, we collect 10M tweets, selected by 5M tweets with positive emoticons and 5M tweets with negative emoticons.

We train SSWEh, SSWEr and SSWEu by taking the derivative of the loss through back-propagation with respect to the whole set of parameters [9], and use AdaGrad [12] to update the parameters. We empirically set the window size as 3, the embedding length as 50, the length of hidden layer as 20 and the learning rate of AdaGrad as 0.1 for all baseline and our models. We learn embedding for unigrams, bigrams and trigrams separately with same neural network and same parameter setting. The contexts of unigram (bigram/trigram) are the surrounding unigrams (bigrams/trigrams), respectively.

3.3 Twitter Sentiment Classification

We apply sentiment-specific word embedding for Twitter sentiment classification under a supervised learning framework as in previous work [33]. Instead of hand-crafting features, we incorporate the continuous representation of words and phrases as the feature of a tweet. The sentiment classifier is built from tweets with manually annotated sentiment polarity.

We explore min, average and max convolutional layers [9, 36], which have been used as simple and effective methods for compositionality learning in vector-based semantics [28], to obtain the tweet representation. The result is the concatenation of vectors derived from different convolutional layers.

z(tw)=[zmax(tw),zmin(tw),zaverage(tw)]

where z(tw) is the representation of tweet tw and zx(tw) is the results of the convolutional layer x{min,max,average}. Each convolutional layer zx employs the embedding of unigrams, bigrams and trigrams separately and conducts the matrix-vector operation of x on the sequence represented by columns in each lookup table. The output of zx is the concatenation of results obtained from different lookup tables.

zx(tw)=[wxLunitw,wxLbitw,wxLtritw]

where wx is the convolutional function of zx, Ltw is the concatenated column vectors of the words in the tweet. Luni, Lbi and Ltri are the lookup tables of the unigram, bigram and trigram embedding, respectively.

4 Experiment

We conduct experiments to evaluate SSWE by incorporating it into a supervised learning framework for Twitter sentiment classification. We also directly evaluate the effectiveness of the SSWE by measuring the word similarity in the embedding space for sentiment lexicons.

4.1 Twitter Sentiment Classification

Experiment Setup and Datasets.

We conduct experiments on the latest Twitter sentiment classification benchmark dataset in SemEval 2013  [31]. The training and development sets were completely in full to task participants. However, we were unable to download all the training and development sets because some tweets were deleted or not available due to modified authorization status. The test set is directly provided to the participants. The distribution of our dataset is given in Table 1. We train sentiment classifier with LibLinear [13] on the training set, tune parameter -c on the dev set and evaluate on the test set. Evaluation metric is the Macro-F1 of positive and negative categories 22We investigate 2-class Twitter sentiment classification (positive/negative) instead of 3-class Twitter sentiment classification (positive/negative/neutral) in SemEval2013..

Positive Negative Neutral Total
Train 2,642 994 3,436 7,072
Dev 408 219 493 1,120
Test 1,570 601 1,639 3,810
Table 1: Statistics of the SemEval 2013 Twitter sentiment classification dataset.

Baseline Methods.

We compare our method with the following sentiment classification algorithms:

(1) DistSuper: We use the 10 million tweets selected by positive and negative emoticons as training data, and build sentiment classifier with LibLinear and ngram features [17].

(2) SVM: The ngram features and Support Vector Machine are widely used baseline methods to build sentiment classifiers [33]. LibLinear is used to train the SVM classifier.

(3) NBSVM: NBSVM [45] is a state-of-the-art performer on many sentiment classification datasets, which trades-off between Naive Bayes and NB-enhanced SVM.

(4) RAE: Recursive Autoencoder [39] has been proven effective in many sentiment analysis tasks by learning compositionality automatically. We run RAE with randomly initialized word embedding.

(5) NRC: NRC builds the top-performed system in SemEval 2013 Twitter sentiment classification track which incorporates diverse sentiment lexicons and many manually designed features. We re-implement this system because the codes are not publicly available 33For 3-class sentiment classification in SemEval 2013, our re-implementation of NRC achieved 68.3%, 0.7% lower than NRC (69%) due to less training data.. NRC-ngram refers to the feature set of NRC leaving out ngram features.

Except for DistSuper, other baseline methods are conducted in a supervised manner. We do not compare with RNTN [40] because we cannot efficiently train the RNTN model. The reason lies in that the tweets in our dataset do not have accurately parsed results or fine grained sentiment labels for phrases. Another reason is that the RNTN model trained on movie reviews cannot be directly applied on tweets due to the differences between domains [8].

Results and Analysis.

Table 2 shows the macro-F1 of the baseline systems as well as the SSWE-based methods on positive/negative sentiment classification of tweets.

Method Macro-F1
DistSuper + unigram 61.74
DistSuper + uni/bi/tri-gram 63.84
SVM + unigram 74.50
SVM + uni/bi/tri-gram 75.06
NBSVM 75.28
RAE 75.12
NRC (Top System in SemEval) 84.73
NRC - ngram 84.17
SSWEu 84.98
SSWEu+NRC 86.58
SSWEu+NRC-ngram 86.48
Table 2: Macro-F1 on positive/negative classification of tweets.

Distant supervision is relatively weak because the noisy-labeled tweets are treated as the gold standard, which affects the performance of classifier. The results of bag-of-ngram (uni/bi/tri-gram) features are not satisfied because the one-hot word representation cannot capture the latent connections between words. NBSVM and RAE perform comparably and have a big gap in comparison with the NRC and SSWE-based methods. The reason is that RAE and NBSVM learn the representation of tweets from the small-scale manually annotated training set, which cannot well capture the comprehensive linguistic phenomenons of words.

NRC implements a variety of features and reaches 84.73% in macro-F1, verifying the importance of a better feature representation for Twitter sentiment classification. We achieve 84.98% by using only SSWEu as features without borrowing any sentiment lexicons or hand-crafted rules. The results indicate that SSWEu automatically learns discriminative features from massive tweets and performs comparable with the state-of-the-art manually designed features. After concatenating SSWEu with the feature set of NRC, the performance is further improved to 86.58%. We also compare SSWEu with the ngram feature by integrating SSWE into NRC-ngram. The concatenated features SSWEu+NRC-ngram (86.48%) outperform the original feature set of NRC (84.73%).

As a reference, we apply SSWEu on subjective classification of tweets, and obtain 72.17% in macro-F1 by using only SSWEu as feature. After combining SSWEu with the feature set of NRC, we improve NRC from 74.86% to 75.39% for subjective classification.

Comparision between Different Word Embedding.

We compare sentiment-specific word embedding (SSWEh, SSWEr, SSWEu) with baseline embedding learning algorithms by only using word embedding as features for Twitter sentiment classification. We use the embedding of unigrams, bigrams and trigrams in the experiment. The embeddings of C&W [9], word2vec44Available at https://code.google.com/p/word2vec/. We utilize the Skip-gram model because it performs better than CBOW in our experiments., WVSA [26] and our models are trained with the same dataset and same parameter setting. We compare with C&W and word2vec as they have been proved effective in many NLP tasks. The trade-off parameter of ReEmb [23] is tuned on the development set of SemEval 2013.

Table 3 shows the performance on the positive/negative classification of tweets55MVSA and ReEmb are not suitable for learning bigram and trigram embedding because their sentiment predictor functions only utilize the unigram embedding.. ReEmb(C&W) and ReEmb(w2v) stand for the use of embeddings learned from 10 million distant-supervised tweets with C&W and word2vec, respectively. Each row of Table 3 represents a word embedding learning algorithm. Each column stands for a type of embedding used to compose features of tweets. The column uni+bi denotes the use of unigram and bigram embedding, and the column uni+bi+tri indicates the use of unigram, bigram and trigram embedding.

Embedding unigram uni+bi uni+bi+tri
C&W 74.89 75.24 75.89
Word2vec 73.21 75.07 76.31
ReEmb(C&W) 75.87
ReEmb(w2v) 75.21
WVSA 77.04
SSWEh 81.33 83.16 83.37
SSWEr 80.45 81.52 82.60
SSWEu 83.70 84.70 84.98
Table 3: Macro-F1 on positive/negative classification of tweets with different word embeddings.

From the first column of Table 3, we can see that the performance of C&W and word2vec are obviously lower than sentiment-specific word embeddings by only using unigram embedding as features. The reason is that C&W and word2vec do not explicitly exploit the sentiment information of the text, resulting in that the words with opposite polarity such as good and bad are mapped to close word vectors. When such word embeddings are fed as features to a Twitter sentiment classifier, the discriminative ability of sentiment words are weakened thus the classification performance is affected. Sentiment-specific word embeddings (SSWEh, SSWEr, SSWEu) effectively distinguish words with opposite sentiment polarity and perform best in three settings. SSWE outperforms MVSA by exploiting more contextual information in the sentiment predictor function. SSWE outperforms ReEmb by leveraging more sentiment information from massive distant-supervised tweets. Among three sentiment-specific word embeddings, SSWEu captures more context information and yields best performance. SSWEh and SSWEr obtain comparative results.

From each row of Table 3, we can see that the bigram and trigram embeddings consistently improve the performance of Twitter sentiment classification. The underlying reason is that a phrase, which cannot be accurately represented by unigram embedding, is directly encoded into the ngram embedding as an idiomatic unit. A typical case in sentiment analysis is that the composed phrase and multiword expression may have a different sentiment polarity than the individual words it contains, such as not [bad] and [great] deal of (the word in the bracket has different sentiment polarity with the ngram). A very recent study by Mikolov et al. [27] also verified the effectiveness of phrase embedding for analogically reasoning phrases.

Effect of α in SSWEu

We tune the hyper-parameter α of SSWEu on the development set by using unigram embedding as features. As given in Equation 8, α is the weighting score of syntactic loss of SSWEu and trades-off the syntactic and sentiment losses. SSWEu is trained from 10 million distant-supervised tweets.

Figure 2: Macro-F1 of SSWEu on the development set of SemEval 2013 with different α.

Figure 2 shows the macro-F1 of SSWEu on positive/negative classification of tweets with different α on our development set. We can see that SSWEu performs better when α is in the range of [0.5, 0.6], which balances the syntactic context and sentiment information. The model with α=1 stands for C&W model, which only encodes the syntactic contexts of words. The sharp decline at α=1 reflects the importance of sentiment information in learning word embedding for Twitter sentiment classification.

Effect of Distant-supervised Data in SSWEu

We investigate how the size of the distant-supervised data affects the performance of SSWEu feature for Twitter sentiment classification. We vary the number of distant-supervised tweets from 1 million to 12 million, increased by 1 million. We set the α of SSWEu as 0.5, according to the experiments shown in Figure 2. Results of positive/negative classification of tweets on our development set are given in Figure 3.

Figure 3: Macro-F1 of SSWEu with different size of distant-supervised data on our development set.

We can see that when more distant-supervised tweets are added, the accuracy of SSWEu consistently improves. The underlying reason is that when more tweets are incorporated, the word embedding is better estimated as the vocabulary size is larger and the context and sentiment information are richer. When we have 10 million distant-supervised tweets, the SSWEu feature increases the macro-F1 of positive/negative classification of tweets to 82.94% on our development set. When we have more than 10 million tweets, the performance remains stable as the contexts of words have been mostly covered.

4.2 Word Similarity of Sentiment Lexicons

The quality of SSWE has been implicitly evaluated when applied in Twitter sentiment classification in the previous subsection. We explicitly evaluate it in this section through word similarity in the embedding space for sentiment lexicons. The evaluation metric is the accuracy of polarity consistency between each sentiment word and its top N closest words in the sentiment lexicon,

Accuracy=i=1#Lexj=1Nβ(wi,cij)#Lex×N (10)

where #Lex is the number of words in the sentiment lexicon, wi is the i-th word in the lexicon, cij is the j-th closest word to wi in the lexicon with cosine similarity, β(wi,cij) is an indicator function that is equal to 1 if wi and cij have the same sentiment polarity and 0 for the opposite case. The higher accuracy refers to a better polarity consistency of words in the sentiment lexicon. We set N as 100 in our experiment.

Experiment Setup and Datasets

We utilize the widely-used sentiment lexicons, namely MPQA [46] and HL [19], to evaluate the quality of word embedding. For each lexicon, we remove the words that do not appear in the lookup table of word embedding. We only use unigram embedding in this section because these sentiment lexicons do not contain phrases. The distribution of the lexicons used in this paper is listed in Table 4.

Lexicon Positive Negative Total
HL 1,331 2,647 3,978
MPQA 1,932 2,817 4,749
Joint 1,051 2,024 3,075
Table 4: Statistics of the sentiment lexicons. Joint stands for the words that occur in both HL and MPQA with the same sentiment polarity.

Results.

Table 5 shows our results compared to other word embedding learning algorithms. The accuracy of random result is 50% as positive and negative words are randomly occurred in the nearest neighbors of each word. Sentiment-specific word embeddings (SSWEh, SSWEr, SSWEu) outperform existing neural models (C&W, word2vec) by large margins.

Embedding HL MPQA Joint
Random 50.00 50.00 50.00
C&W 63.10 58.13 62.58
Word2vec 66.22 60.72 65.59
ReEmb(C&W) 64.81 59.76 64.09
ReEmb(w2v) 67.16 61.81 66.39
WVSA 68.14 64.07 67.12
SSWEh 74.17 68.36 74.03
SSWEr 73.65 68.02 73.14
SSWEu 77.30 71.74 77.33
Table 5: Accuracy of the polarity consistency of words in different sentiment lexicons.

SSWEu performs best in three lexicons. SSWEh and SSWEr have comparable performances. Experimental results further demonstrate that sentiment-specific word embeddings are able to capture the sentiment information of texts and distinguish words with opposite sentiment polarity, which are not well solved in traditional neural models like C&W and word2vec. SSWE outperforms MVSA and ReEmb by exploiting more context information of words and sentiment information of sentences, respectively.

5 Conclusion

In this paper, we propose learning continuous word representations as features for Twitter sentiment classification under a supervised learning framework. We show that the word embedding learned by traditional neural networks are not effective enough for Twitter sentiment classification. These methods typically only model the context information of words so that they cannot distinguish words with similar context but opposite sentiment polarity (e.g. good and bad). We learn sentiment-specific word embedding (SSWE) by integrating the sentiment information into the loss functions of three neural networks. We train SSWE with massive distant-supervised tweets selected by positive and negative emoticons. The effectiveness of SSWE has been implicitly evaluated by using it as features in sentiment classification on the benchmark dataset in SemEval 2013, and explicitly verified by measuring word similarity in the embedding space for sentiment lexicons. Our unified model combining syntactic context of words and sentiment information of sentences yields the best performance in both experiments.

Acknowledgments

We thank Yajuan Duan, Shujie Liu, Zhenghua Li, Li Dong, Hong Sun and Lanjun Zhou for their great help. This research was partly supported by National Natural Science Foundation of China (No.61133012, No.61273321, No.61300113).

References

  • [1] L. Barbosa and J. Feng(2010) Robust sentiment detection on twitter from biased and noisy data. pp. 36–44. Cited by: 2.1.
  • [2] Y. Bengio, A. Courville and P. Vincent(2013) Representation learning: a review and new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence. Cited by: 2.2.
  • [3] Y. Bengio, R. Ducharme, P. Vincent and C. Janvin(2003) A neural probabilistic language model. Journal of Machine Learning Research 3, pp. 1137–1155. Cited by: 2.2.
  • [4] Y. Bengio(2013) Deep learning of representations: looking forward. arXiv preprint arXiv:1305.0445. Cited by: 1.
  • [5] D. Bespalov, B. Bai, Y. Qi and A. Shokoufandeh(2011) Sentiment classification based on supervised latent n-gram analysis. pp. 375–382. Cited by: 2.2.
  • [6] D. Bespalov, Y. Qi, B. Bai and A. Shokoufandeh(2012) Sentiment classification with supervised sequence embedding. Machine Learning and Knowledge Discovery in Databases, pp. 159–174. Cited by: 2.2.
  • [7] D. M. Blei, A. Y. Ng and M. I. Jordan(2003) Latent dirichlet allocation. the Journal of machine Learning research 3, pp. 993–1022. Cited by: 2.2.
  • [8] J. Blitzer, M. Dredze and F. Pereira(2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Vol. 7. Cited by: 4.1.
  • [9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: 1, 1, 3.1, 3.2, 3.2, 3.3, 3, 4.1.
  • [10] D. Davidov, O. Tsur and A. Rappoport(2010) Enhanced sentiment learning using twitter hashtags and smileys. pp. 241–249. Cited by: 2.1.
  • [11] X. Ding, B. Liu and P. S. Yu(2008) A holistic lexicon-based approach to opinion mining. pp. 231–240. Cited by: 2.1.
  • [12] J. Duchi, E. Hazan and Y. Singer(2011) Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, pp. 2121–2159. Cited by: 3.2.
  • [13] R. Fan, K. Chang, C. Hsieh, X. Wang and C. Lin(2008) LIBLINEAR: a library for large linear classification. The Journal of Machine Learning Research 9, pp. 1871–1874. Cited by: 4.1.
  • [14] R. Feldman(2013) Techniques and applications for sentiment analysis. Communications of the ACM 56 (4), pp. 82–89. Cited by: 2.2.
  • [15] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan and N. A. Smith(2011) Part-of-speech tagging for twitter: annotation, features, and experiments. pp. 42–47. External Links: Link Cited by: 3.2.
  • [16] X. Glorot, A. Bordes and Y. Bengio(2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. Proceedings of International Conference on Machine Learning. Cited by: 2.2.
  • [17] A. Go, R. Bhayani and L. Huang(2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, pp. 1–12. Cited by: 2.1, 4.1.
  • [18] K. M. Hermann and P. Blunsom(2013) The role of syntax in vector space models of compositional semantics. pp. 894–904. External Links: Link Cited by: 2.2.
  • [19] M. Hu and B. Liu(2004) Mining and summarizing customer reviews. pp. 168–177. Cited by: 4.2.
  • [20] X. Hu, J. Tang, H. Gao and H. Liu(2013) Unsupervised sentiment analysis with emotional signals. pp. 607–618. Cited by: 1, 2.1, 2.1, 3.2.
  • [21] L. Jiang, M. Yu, M. Zhou, X. Liu and T. Zhao(2011) Target-dependent twitter sentiment classification. The Proceeding of Annual Meeting of the Association for Computational Linguistics 1, pp. 151–160. Cited by: 1, 2.1.
  • [22] E. Kouloumpis, T. Wilson and J. Moore(2011) Twitter sentiment analysis: the good the bad and the omg!. Cited by: 2.1.
  • [23] I. Labutov and H. Lipson(2013) Re-embedding words. Cited by: 2.2, 4.1.
  • [24] B. Liu(2012) Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies 5 (1), pp. 1–167. Cited by: 2.2.
  • [25] K. Liu, W. Li and M. Guo(2012) Emoticon smoothed language models for twitter sentiment analysis. Cited by: 2.1.
  • [26] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng and C. Potts(2011) Learning word vectors for sentiment analysis. Cited by: 2.2, 4.1.
  • [27] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean(2013) Distributed representations of words and phrases and their compositionality. The Conference on Neural Information Processing Systems. Cited by: 1, 4.1.
  • [28] J. Mitchell and M. Lapata(2010) Composition in distributional models of semantics. Cognitive Science 34 (8), pp. 1388–1429. Cited by: 3.3.
  • [29] A. Mnih and G. E. Hinton(2009) A scalable hierarchical distributed language model. pp. 1081–1088. Cited by: 2.2.
  • [30] S. M. Mohammad, S. Kiritchenko and X. Zhu(2013) NRC-canada: building the state-of-the-art in sentiment analysis of tweets. Proceedings of the International Workshop on Semantic Evaluation. Cited by: 1, 2.1.
  • [31] P. Nakov, S. Rosenthal, Z. Kozareva, V. Stoyanov, A. Ritter and T. Wilson(2013) SemEval-2013 task 2: sentiment analysis in twitter. Vol. 13. Cited by: 1, 4.1.
  • [32] A. Pak and P. Paroubek(2010) Twitter as a corpus for sentiment analysis and opinion mining. Vol. 2010. Cited by: 2.1.
  • [33] B. Pang, L. Lee and S. Vaithyanathan(2002) Thumbs up?: sentiment classification using machine learning techniques. pp. 79–86. Cited by: 1, 2.1, 2.2, 3.3, 4.1.
  • [34] B. Pang and L. Lee(2008) Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2 (1-2), pp. 1–135. Cited by: 2.2.
  • [35] R. Socher, J. Bauer, C. D. Manning and A. Y. Ng(2013) Parsing with compositional vector grammars. Annual Meeting of the Association for Computational Linguistics, Cited by: 2.2.
  • [36] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng and C. D. Manning(2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. The Conference on Neural Information Processing Systems 24, pp. 801–809. Cited by: 3.3.
  • [37] R. Socher, B. Huval, C. D. Manning and A. Y. Ng(2012) Semantic Compositionality Through Recursive Matrix-Vector Spaces. Cited by: 2.2.
  • [38] R. Socher, C. C. Lin, A. Ng and C. Manning(2011) Parsing natural scenes and natural language with recursive neural networks. pp. 129–136. Cited by: 2.2.
  • [39] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng and C.D. Manning(2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. pp. 151–161. Cited by: 2.2, 4.1.
  • [40] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng and C. Potts(2013) Recursive deep models for semantic compositionality over a sentiment treebank. pp. 1631–1642. External Links: Link Cited by: 1, 2.2, 4.1.
  • [41] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede(2011) Lexicon-based methods for sentiment analysis. Computational linguistics 37 (2), pp. 267–307. Cited by: 2.1.
  • [42] M. Thelwall, K. Buckley and G. Paltoglou(2012) Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology 63 (1), pp. 163–173. Cited by: 2.1.
  • [43] J. Turian, L. Ratinov and Y. Bengio(2010) Word representations: a simple and general method for semi-supervised learning. Annual Meeting of the Association for Computational Linguistics. Cited by: 2.2.
  • [44] P. D. Turney(2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. pp. 417–424. Cited by: 2.1.
  • [45] S. Wang and C. D. Manning(2012) Baselines and bigrams: simple, good sentiment and topic classification. pp. 90–94. Cited by: 4.1.
  • [46] T. Wilson, J. Wiebe and P. Hoffmann(2005) Recognizing contextual polarity in phrase-level sentiment analysis. pp. 347–354. Cited by: 4.2.
  • [47] A. Yessenalina and C. Cardie(2011) Compositional matrix-space models for sentiment analysis. pp. 172–182. Cited by: 1, 2.2.
  • [48] J. Zhao, L. Dong, J. Wu and K. Xu(2012) MoodLens: an emoticon-based sentiment analysis system for chinese tweets. Cited by: 2.1.
  • [49] X. Zheng, H. Chen and T. Xu(2013) Deep learning for chinese word segmentation and pos tagging. pp. 647–657. Cited by: 1.