We present a method that learns word embedding for Twitter sentiment classification in this paper. Most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text. This is problematic for sentiment analysis as they usually map words with similar syntactic context but opposite sentiment polarity, such as good and bad, to neighboring word vectors. We address this issue by learning sentiment-specific word embedding (SSWE), which encodes sentiment information in the continuous representation of words. Specifically, we develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. To obtain large scale training corpora, we learn the sentiment-specific word embedding from massive distant-supervised tweets collected by positive and negative emoticons. Experiments on applying SSWE to a benchmark Twitter sentiment classification dataset in SemEval 2013 show that (1) the SSWE feature performs comparably with hand-crafted features in the top-performed system; (2) the performance is further improved by concatenating SSWE with existing feature set.
Twitter sentiment classification has attracted increasing research interest in recent years [21, 20]. The objective is to classify the sentiment polarity of a tweet as positive, negative or neutral. The majority of existing approaches follow Pang et al. [33] and employ machine learning algorithms to build classifiers from tweets with manually annotated sentiment polarity. Under this direction, most studies focus on designing effective features to obtain better classification performance. For example, Mohammad et al. (2013) build the top-performed system in the Twitter sentiment classification track of SemEval 2013 [31], using diverse sentiment lexicons and a variety of hand-crafted features.
Feature engineering is important but labor-intensive. It is therefore desirable to discover explanatory factors from the data and make the learning algorithms less dependent on extensive feature engineering [4]. For the task of sentiment classification, an effective feature learning method is to compose the representation of a sentence (or document) from the representations of the words or phrases it contains [40, 47]. Accordingly, it is a crucial step to learn the word representation (or word embedding), which is a dense, low-dimensional and real-valued vector for a word. Although existing word embedding learning algorithms [9, 27] are intuitive choices, they are not effective enough if directly used for sentiment classification. The most serious problem is that traditional methods typically model the syntactic context of words but ignore the sentiment information of text. As a result, words with opposite polarity, such as good and bad, are mapped into close vectors. It is meaningful for some tasks such as pos-tagging [49] as the two words have similar usages and grammatical roles, but it becomes a disaster for sentiment analysis as they have the opposite sentiment polarity.
In this paper, we propose learning sentiment-specific word embedding (SSWE) for sentiment analysis. We encode the sentiment information into the continuous representation of words, so that it is able to separate good and bad to opposite ends of the spectrum. To this end, we extend the existing word embedding learning algorithm [9] and develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. We learn the sentiment-specific word embedding from tweets, leveraging massive tweets with emoticons as distant-supervised corpora without any manual annotations. These automatically collected tweets contain noises so they cannot be directly used as gold training data to build sentiment classifiers, but they are effective enough to provide weakly supervised signals for training the sentiment-specific word embedding.
We apply SSWE as features in a supervised learning framework for Twitter sentiment classification, and evaluate it on the benchmark dataset in SemEval 2013. In the task of predicting positive/negative polarity of tweets, our method yields 84.89% in macro-F1 by only using SSWE as feature, which is comparable to the top-performed system based on hand-crafted features (84.70%). After concatenating the SSWE feature with existing feature set, we push the state-of-the-art to 86.58% in macro-F1. The quality of SSWE is also directly evaluated by measuring the word similarity in the embedding space for sentiment lexicons. In the accuracy of polarity consistency between each sentiment word and its top closest words, SSWE outperforms existing word embedding learning algorithms.
The major contributions of the work presented in this paper are as follows.
We develop three neural networks to learn sentiment-specific word embedding (SSWE) from massive distant-supervised tweets without any manual annotations;
To our knowledge, this is the first work that exploits word embedding for Twitter sentiment classification. We report the results that the SSWE feature performs comparably with hand-crafted features in the top-performed system in SemEval 2013;
We release the sentiment-specific word embedding learned from 10 million tweets, which can be adopted off-the-shell in other sentiment analysis tasks.
In this section, we present a brief review of the related work from two perspectives, Twitter sentiment classification and learning continuous representations for sentiment classification.
Twitter sentiment classification, which identifies the sentiment polarity of short, informal tweets, has attracted increasing research interest [21, 20] in recent years. Generally, the methods employed in Twitter sentiment classification follow traditional sentiment classification approaches. The lexicon-based approaches [44, 11, 41, 42] mostly use a dictionary of sentiment words with their associated sentiment polarity, and incorporate negation and intensification to compute the sentiment polarity for each sentence (or document).
The learning based methods for Twitter sentiment classification follow Pang et al. (2002)’s work, which treat sentiment classification of texts as a special case of text categorization issue. Many studies on Twitter sentiment classification [32, 10, 1, 22, 48] leverage massive noisy-labeled tweets selected by positive and negative emoticons as training set and build sentiment classifiers directly, which is called distant supervision [17]. Instead of directly using the distant-supervised data as training set, Liu et al. [25] adopt the tweets with emoticons to smooth the language model and Hu et al. [20] incorporate the emotional signals into an unsupervised learning framework for Twitter sentiment classification.
Many existing learning based methods on Twitter sentiment classification focus on feature engineering. The reason is that the performance of sentiment classifier being heavily dependent on the choice of feature representation of tweets. The most representative system is introduced by Mohammad et al. [30], which is the state-of-the-art system (the top-performed system in SemEval 2013 Twitter Sentiment Classification Track) by implementing a number of hand-crafted features. Unlike the previous studies, we focus on learning discriminative features automatically from massive distant-supervised tweets.
Pang et al. (2002) pioneer this field by using bag-of-word representation, representing each word as a one-hot vector. It has the same length as the size of the vocabulary, and only one dimension is 1, with all others being 0. Under this assumption, many feature learning algorithms are proposed to obtain better classification performance [34, 24, 14]. However, the one-hot word representation cannot sufficiently capture the complex linguistic characteristics of words.
With the revival of interest in deep learning [2], incorporating the continuous representation of a word as features has been proving effective in a variety of NLP tasks, such as parsing [35], language modeling [3, 29] and NER [43]. In the field of sentiment analysis, Bespalov et al. [5, 6] initialize the word embedding by Latent Semantic Analysis and further represent each document as the linear weighted of ngram vectors for sentiment classification. Yessenalina and Cardie [47] model each word as a matrix and combine words using iterated matrix multiplication. Glorot et al. [16] explore Stacked Denoising Autoencoders for domain adaptation in sentiment classification. Socher et al. propose Recursive Neural Network (RNN) [38], matrix-vector RNN [37] and Recursive Neural Tensor Network (RNTN) [40] to learn the compositionality of phrases of any length based on the representation of each pair of children recursively. Hermann et al. [18] present Combinatory Categorial Autoencoders to learn the compositionality of sentence, which marries the Combinatory Categorial Grammar with Recursive Autoencoder.
The representation of words heavily relies on the applications or tasks in which it is used [23]. This paper focuses on learning sentiment-specific word embedding, which is tailored for sentiment analysis. Unlike Maas et al. (2011) that follow the probabilistic document model [7] and give an sentiment predictor function to each word, we develop neural networks and map each ngram to the sentiment polarity of sentence. Unlike Socher et al. (2011c) that utilize manually labeled texts to learn the meaning of phrase (or sentence) through compositionality, we focus on learning the meaning of word, namely word embedding, from massive distant-supervised tweets. Unlike Labutov and Lipson (2013) that produce task-specific embedding from an existing word embedding, we learn sentiment-specific word embedding from scratch.
In this section, we present the details of learning sentiment-specific word embedding (SSWE) for Twitter sentiment classification. We propose incorporating the sentiment information of sentences to learn continuous representations for words and phrases. We extend the existing word embedding learning algorithm [9] and develop three neural networks to learn SSWE. In the following sections, we introduce the traditional method before presenting the details of SSWE learning algorithms. We then describe the use of SSWE in a supervised learning framework for Twitter sentiment classification.
Collobert et al. (2011) introduce C&W model to learn word embedding based on the syntactic contexts of words. Given an ngram “cat chills on a mat”, C&W replaces the center word with a random word and derives a corrupted ngram “cat chills a mat”. The training objective is that the original ngram is expected to obtain a higher language model score than the corrupted ngram by a margin of 1. The ranking objective function can be optimized by a hinge loss,
(1) |
where is the original ngram, is the corrupted ngram, is a one-dimensional scalar representing the language model score of the input ngram.
Figure 1(a) illustrates the neural architecture of C&W, which consists of four layers, namely (from bottom to top). The original and corrupted ngrams are treated as inputs of the feed-forward neural network, respectively. The output is the language model score of the input, which is calculated as given in Equation 2, where is the lookup table of word embedding, are the parameters of linear layers.
(2) |
(3) |
(4) |
Following the traditional C&W model [9], we incorporate the sentiment information into the neural network to learn sentiment-specific word embedding. We develop three neural networks with different strategies to integrate the sentiment information of tweets.
As an unsupervised approach, C&W model does not explicitly capture the sentiment information of texts. An intuitive solution to integrate the sentiment information is predicting the sentiment distribution of text based on input ngram. We do not utilize the entire sentence as input because the length of different sentences might be variant. We therefore slide the window of ngram across a sentence, and then predict the sentiment polarity based on each ngram with a shared neural network. In the neural network, the distributed representation of higher layer are interpreted as features describing the input. Thus, we utilize the continuous vector of top layer to predict the sentiment distribution of text.
Assuming there are labels, we modify the dimension of top layer in C&W model as and add a layer upon the top layer. The neural network (SSWE) is given in Figure 1(b). layer is suitable for this scenario because its outputs are interpreted as conditional probabilities. Unlike C&W, SSWE does not generate any corrupted ngram. Let , where denotes the number of sentiment polarity labels, be the gold -dimensional multinomial distribution of input and . For positive/negative classification, the distribution is of the form [1,0] for positive and [0,1] for negative. The cross-entropy error of the layer is :
(5) |
where is the gold sentiment distribution and is the predicted sentiment distribution.
SSWE is trained by predicting the positive ngram as [1,0] and the negative ngram as [0,1]. However, the constraint of SSWE is too strict. The distribution of [0.7,0.3] can also be interpreted as a positive label because the positive score is larger than the negative score. Similarly, the distribution of [0.2,0.8] indicates negative polarity. Based on the above observation, the hard constraints in SSWE should be relaxed. If the sentiment polarity of a tweet is positive, the predicted positive score is expected to be larger than the predicted negative score, and the exact reverse if the tweet has negative polarity.
We model the relaxed constraint with a ranking objective function and borrow the bottom four layers from SSWE, namely in Figure 1(b), to build the relaxed neural network (SSWE). Compared with SSWE, the layer is removed because SSWE does not require probabilistic interpretation. The hinge loss of SSWE is modeled as described below.
(6) | ||||
where is the predicted positive score, is the predicted negative score, is an indicator function reflecting the sentiment polarity of a sentence,
(7) |
Similar with SSWE, SSWE also does not generate the corrupted ngram.
The C&W model learns word embedding by modeling syntactic contexts of words but ignoring sentiment information. By contrast, SSWE and SSWE learn sentiment-specific word embedding by integrating the sentiment polarity of sentences but leaving out the syntactic contexts of words. We develop a unified model (SSWE) in this part, which captures the sentiment information of sentences as well as the syntactic contexts of words. SSWE is illustrated in Figure 1(c).
Given an original (or corrupted) ngram and the sentiment polarity of a sentence as the input, SSWE predicts a two-dimensional vector for each input ngram. The two scalars (, ) stand for language model score and sentiment score of the input ngram, respectively. The training objectives of SSWE are that (1) the original ngram should obtain a higher language model score than the corrupted ngram , and (2) the sentiment score of original ngram should be more consistent with the gold polarity annotation of sentence than corrupted ngram . The loss function of SSWE is the linear combination of two hinge losses,
(8) | ||||
where is the syntactic loss as given in Equation 1, is the sentiment loss as described in Equation 9. The hyper-parameter weighs the two parts.
(9) | ||||
We train sentiment-specific word embedding from massive distant-supervised tweets collected with positive and negative emoticons11We use the emoticons selected by Hu et al. [20]. The positive emoticons are :) : ) :-) :D =), and the negative emoticons are :( : ( :-( .. We crawl tweets from April 1st, 2013 to April 30th, 2013 with TwitterAPI. We tokenize each tweet with TwitterNLP [15], remove the @user and URLs of each tweet, and filter the tweets that are too short ( 7 words). Finally, we collect 10M tweets, selected by 5M tweets with positive emoticons and 5M tweets with negative emoticons.
We train SSWE, SSWE and SSWE by taking the derivative of the loss through back-propagation with respect to the whole set of parameters [9], and use AdaGrad [12] to update the parameters. We empirically set the window size as 3, the embedding length as 50, the length of hidden layer as 20 and the learning rate of AdaGrad as 0.1 for all baseline and our models. We learn embedding for unigrams, bigrams and trigrams separately with same neural network and same parameter setting. The contexts of unigram (bigram/trigram) are the surrounding unigrams (bigrams/trigrams), respectively.
We apply sentiment-specific word embedding for Twitter sentiment classification under a supervised learning framework as in previous work [33]. Instead of hand-crafting features, we incorporate the continuous representation of words and phrases as the feature of a tweet. The sentiment classifier is built from tweets with manually annotated sentiment polarity.
We explore , and convolutional layers [9, 36], which have been used as simple and effective methods for compositionality learning in vector-based semantics [28], to obtain the tweet representation. The result is the concatenation of vectors derived from different convolutional layers.
where is the representation of tweet and is the results of the convolutional layer . Each convolutional layer employs the embedding of unigrams, bigrams and trigrams separately and conducts the matrix-vector operation of on the sequence represented by columns in each lookup table. The output of is the concatenation of results obtained from different lookup tables.
where is the convolutional function of , is the concatenated column vectors of the words in the tweet. , and are the lookup tables of the unigram, bigram and trigram embedding, respectively.
We conduct experiments to evaluate SSWE by incorporating it into a supervised learning framework for Twitter sentiment classification. We also directly evaluate the effectiveness of the SSWE by measuring the word similarity in the embedding space for sentiment lexicons.
We conduct experiments on the latest Twitter sentiment classification benchmark dataset in SemEval 2013 [31]. The training and development sets were completely in full to task participants. However, we were unable to download all the training and development sets because some tweets were deleted or not available due to modified authorization status. The test set is directly provided to the participants. The distribution of our dataset is given in Table 1. We train sentiment classifier with LibLinear [13] on the training set, tune parameter on the dev set and evaluate on the test set. Evaluation metric is the Macro-F1 of positive and negative categories 22We investigate 2-class Twitter sentiment classification (positive/negative) instead of 3-class Twitter sentiment classification (positive/negative/neutral) in SemEval2013..
Positive | Negative | Neutral | Total | |
---|---|---|---|---|
Train | 2,642 | 994 | 3,436 | 7,072 |
Dev | 408 | 219 | 493 | 1,120 |
Test | 1,570 | 601 | 1,639 | 3,810 |
We compare our method with the following sentiment classification algorithms:
(1) DistSuper: We use the 10 million tweets selected by positive and negative emoticons as training data, and build sentiment classifier with LibLinear and ngram features [17].
(2) SVM: The ngram features and Support Vector Machine are widely used baseline methods to build sentiment classifiers [33]. LibLinear is used to train the SVM classifier.
(3) NBSVM: NBSVM [45] is a state-of-the-art performer on many sentiment classification datasets, which trades-off between Naive Bayes and NB-enhanced SVM.
(4) RAE: Recursive Autoencoder [39] has been proven effective in many sentiment analysis tasks by learning compositionality automatically. We run RAE with randomly initialized word embedding.
(5) NRC: NRC builds the top-performed system in SemEval 2013 Twitter sentiment classification track which incorporates diverse sentiment lexicons and many manually designed features. We re-implement this system because the codes are not publicly available 33For 3-class sentiment classification in SemEval 2013, our re-implementation of NRC achieved 68.3%, 0.7% lower than NRC (69%) due to less training data.. NRC-ngram refers to the feature set of NRC leaving out ngram features.
Except for , other baseline methods are conducted in a supervised manner. We do not compare with RNTN [40] because we cannot efficiently train the RNTN model. The reason lies in that the tweets in our dataset do not have accurately parsed results or fine grained sentiment labels for phrases. Another reason is that the RNTN model trained on movie reviews cannot be directly applied on tweets due to the differences between domains [8].
Table 2 shows the macro-F1 of the baseline systems as well as the SSWE-based methods on positive/negative sentiment classification of tweets.
Method | Macro-F1 |
---|---|
DistSuper + unigram | 61.74 |
DistSuper + uni/bi/tri-gram | 63.84 |
SVM + unigram | 74.50 |
SVM + uni/bi/tri-gram | 75.06 |
NBSVM | 75.28 |
RAE | 75.12 |
NRC (Top System in SemEval) | 84.73 |
NRC - ngram | 84.17 |
SSWE | 84.98 |
SSWE+NRC | 86.58 |
SSWE+NRC-ngram | 86.48 |
Distant supervision is relatively weak because the noisy-labeled tweets are treated as the gold standard, which affects the performance of classifier. The results of bag-of-ngram (uni/bi/tri-gram) features are not satisfied because the one-hot word representation cannot capture the latent connections between words. NBSVM and RAE perform comparably and have a big gap in comparison with the NRC and SSWE-based methods. The reason is that RAE and NBSVM learn the representation of tweets from the small-scale manually annotated training set, which cannot well capture the comprehensive linguistic phenomenons of words.
NRC implements a variety of features and reaches 84.73% in macro-F1, verifying the importance of a better feature representation for Twitter sentiment classification. We achieve 84.98% by using only SSWE as features without borrowing any sentiment lexicons or hand-crafted rules. The results indicate that SSWE automatically learns discriminative features from massive tweets and performs comparable with the state-of-the-art manually designed features. After concatenating SSWE with the feature set of NRC, the performance is further improved to 86.58%. We also compare SSWE with the ngram feature by integrating SSWE into NRC-ngram. The concatenated features SSWE+NRC-ngram (86.48%) outperform the original feature set of NRC (84.73%).
As a reference, we apply SSWE on subjective classification of tweets, and obtain 72.17% in macro-F1 by using only SSWE as feature. After combining SSWE with the feature set of NRC, we improve NRC from 74.86% to 75.39% for subjective classification.
We compare sentiment-specific word embedding (SSWE, SSWE, SSWE) with baseline embedding learning algorithms by only using word embedding as features for Twitter sentiment classification. We use the embedding of unigrams, bigrams and trigrams in the experiment. The embeddings of C&W [9], word2vec44Available at https://code.google.com/p/word2vec/. We utilize the Skip-gram model because it performs better than CBOW in our experiments., WVSA [26] and our models are trained with the same dataset and same parameter setting. We compare with C&W and word2vec as they have been proved effective in many NLP tasks. The trade-off parameter of ReEmb [23] is tuned on the development set of SemEval 2013.
Table 3 shows the performance on the positive/negative classification of tweets55MVSA and ReEmb are not suitable for learning bigram and trigram embedding because their sentiment predictor functions only utilize the unigram embedding.. ReEmb(C&W) and ReEmb(w2v) stand for the use of embeddings learned from 10 million distant-supervised tweets with C&W and word2vec, respectively. Each row of Table 3 represents a word embedding learning algorithm. Each column stands for a type of embedding used to compose features of tweets. The column uni+bi denotes the use of unigram and bigram embedding, and the column uni+bi+tri indicates the use of unigram, bigram and trigram embedding.
Embedding | unigram | uni+bi | uni+bi+tri |
---|---|---|---|
C&W | 74.89 | 75.24 | 75.89 |
Word2vec | 73.21 | 75.07 | 76.31 |
ReEmb(C&W) | 75.87 | – | – |
ReEmb(w2v) | 75.21 | – | – |
WVSA | 77.04 | – | – |
SSWE | 81.33 | 83.16 | 83.37 |
SSWE | 80.45 | 81.52 | 82.60 |
SSWE | 83.70 | 84.70 | 84.98 |
From the first column of Table 3, we can see that the performance of C&W and word2vec are obviously lower than sentiment-specific word embeddings by only using unigram embedding as features. The reason is that C&W and word2vec do not explicitly exploit the sentiment information of the text, resulting in that the words with opposite polarity such as good and bad are mapped to close word vectors. When such word embeddings are fed as features to a Twitter sentiment classifier, the discriminative ability of sentiment words are weakened thus the classification performance is affected. Sentiment-specific word embeddings (SSWE, SSWE, SSWE) effectively distinguish words with opposite sentiment polarity and perform best in three settings. SSWE outperforms MVSA by exploiting more contextual information in the sentiment predictor function. SSWE outperforms ReEmb by leveraging more sentiment information from massive distant-supervised tweets. Among three sentiment-specific word embeddings, SSWE captures more context information and yields best performance. SSWE and SSWE obtain comparative results.
From each row of Table 3, we can see that the bigram and trigram embeddings consistently improve the performance of Twitter sentiment classification. The underlying reason is that a phrase, which cannot be accurately represented by unigram embedding, is directly encoded into the ngram embedding as an idiomatic unit. A typical case in sentiment analysis is that the composed phrase and multiword expression may have a different sentiment polarity than the individual words it contains, such as not [bad] and [great] deal of (the word in the bracket has different sentiment polarity with the ngram). A very recent study by Mikolov et al. [27] also verified the effectiveness of phrase embedding for analogically reasoning phrases.
We tune the hyper-parameter of SSWE on the development set by using unigram embedding as features. As given in Equation 8, is the weighting score of syntactic loss of SSWE and trades-off the syntactic and sentiment losses. SSWE is trained from 10 million distant-supervised tweets.
Figure 2 shows the macro-F1 of SSWE on positive/negative classification of tweets with different on our development set. We can see that SSWE performs better when is in the range of [0.5, 0.6], which balances the syntactic context and sentiment information. The model with =1 stands for C&W model, which only encodes the syntactic contexts of words. The sharp decline at =1 reflects the importance of sentiment information in learning word embedding for Twitter sentiment classification.
We investigate how the size of the distant-supervised data affects the performance of SSWE feature for Twitter sentiment classification. We vary the number of distant-supervised tweets from 1 million to 12 million, increased by 1 million. We set the of SSWE as 0.5, according to the experiments shown in Figure 2. Results of positive/negative classification of tweets on our development set are given in Figure 3.
We can see that when more distant-supervised tweets are added, the accuracy of SSWE consistently improves. The underlying reason is that when more tweets are incorporated, the word embedding is better estimated as the vocabulary size is larger and the context and sentiment information are richer. When we have 10 million distant-supervised tweets, the SSWE feature increases the macro-F1 of positive/negative classification of tweets to 82.94% on our development set. When we have more than 10 million tweets, the performance remains stable as the contexts of words have been mostly covered.
The quality of SSWE has been implicitly evaluated when applied in Twitter sentiment classification in the previous subsection. We explicitly evaluate it in this section through word similarity in the embedding space for sentiment lexicons. The evaluation metric is the accuracy of polarity consistency between each sentiment word and its top closest words in the sentiment lexicon,
(10) |
where is the number of words in the sentiment lexicon, is the i-th word in the lexicon, is the j-th closest word to in the lexicon with cosine similarity, is an indicator function that is equal to 1 if and have the same sentiment polarity and 0 for the opposite case. The higher accuracy refers to a better polarity consistency of words in the sentiment lexicon. We set as 100 in our experiment.
We utilize the widely-used sentiment lexicons, namely MPQA [46] and HL [19], to evaluate the quality of word embedding. For each lexicon, we remove the words that do not appear in the lookup table of word embedding. We only use unigram embedding in this section because these sentiment lexicons do not contain phrases. The distribution of the lexicons used in this paper is listed in Table 4.
Lexicon | Positive | Negative | Total |
---|---|---|---|
HL | 1,331 | 2,647 | 3,978 |
MPQA | 1,932 | 2,817 | 4,749 |
Joint | 1,051 | 2,024 | 3,075 |
Table 5 shows our results compared to other word embedding learning algorithms. The accuracy of random result is 50% as positive and negative words are randomly occurred in the nearest neighbors of each word. Sentiment-specific word embeddings (SSWE, SSWE, SSWE) outperform existing neural models (C&W, word2vec) by large margins.
Embedding | HL | MPQA | Joint |
---|---|---|---|
Random | 50.00 | 50.00 | 50.00 |
C&W | 63.10 | 58.13 | 62.58 |
Word2vec | 66.22 | 60.72 | 65.59 |
ReEmb(C&W) | 64.81 | 59.76 | 64.09 |
ReEmb(w2v) | 67.16 | 61.81 | 66.39 |
WVSA | 68.14 | 64.07 | 67.12 |
SSWE | 74.17 | 68.36 | 74.03 |
SSWE | 73.65 | 68.02 | 73.14 |
SSWE | 77.30 | 71.74 | 77.33 |
SSWE performs best in three lexicons. SSWE and SSWE have comparable performances. Experimental results further demonstrate that sentiment-specific word embeddings are able to capture the sentiment information of texts and distinguish words with opposite sentiment polarity, which are not well solved in traditional neural models like C&W and word2vec. SSWE outperforms MVSA and ReEmb by exploiting more context information of words and sentiment information of sentences, respectively.
In this paper, we propose learning continuous word representations as features for Twitter sentiment classification under a supervised learning framework. We show that the word embedding learned by traditional neural networks are not effective enough for Twitter sentiment classification. These methods typically only model the context information of words so that they cannot distinguish words with similar context but opposite sentiment polarity (e.g. good and bad). We learn sentiment-specific word embedding (SSWE) by integrating the sentiment information into the loss functions of three neural networks. We train SSWE with massive distant-supervised tweets selected by positive and negative emoticons. The effectiveness of SSWE has been implicitly evaluated by using it as features in sentiment classification on the benchmark dataset in SemEval 2013, and explicitly verified by measuring word similarity in the embedding space for sentiment lexicons. Our unified model combining syntactic context of words and sentiment information of sentences yields the best performance in both experiments.
We thank Yajuan Duan, Shujie Liu, Zhenghua Li, Li Dong, Hong Sun and Lanjun Zhou for their great help. This research was partly supported by National Natural Science Foundation of China (No.61133012, No.61273321, No.61300113).