Building Sentiment Lexicons for All Major Languages

Yanqing Chen
Computer Science Dept.
Stony Brook University
Stony Brook, NY 11794
cyanqing@cs.stonybrook.edu
   Steven Skiena
Computer Science Dept.
Stony Brook University
Stony Brook, NY 11794
skiena@cs.stonybrook.edu
Mar, 2014
Abstract

Sentiment analysis in a multilingual world remains a challenging problem, because developing language-specific sentiment lexicons is an extremely resource-intensive process. Such lexicons remain a scarce resource for most languages.

In this paper, we address this lexicon gap by building high-quality sentiment lexicons for 136 major languages. We integrate a variety of linguistic resources to produce an immense knowledge graph. By appropriately propagating from seed words, we construct sentiment lexicons for each component language of our graph. Our lexicons have a polarity agreement of 95.7% with published lexicons, while achieving an overall coverage of 45.2%.

We demonstrate the performance of our lexicons in an extrinsic analysis of 2,000 distinct historical figures’ Wikipedia articles on 30 languages. Despite cultural difference and the intended neutrality of Wikipedia articles, our lexicons show an average sentiment correlation of 0.28 across all language pairs.

1 Introduction

Sentiment analysis of English texts has become a large and active research area, with many commercial applications, but the barrier of language limits the ability to assess the sentiment of most of the world’s population.

Although several well-regarded sentiment lexicons are available in English [9, 17], the same is not true for most of the world’s languages. Indeed, our literature search identified only 12 publicly available sentiment lexicons for only 5 non-English languages (Chinese mandarin, German, Arabic, Japanese and Italian). No doubt we missed some, but it is clear that these resources are not widely available for most important languages.

In this paper, we strive to produce a comprehensive set of sentiment lexicons for the worlds’ major languages. We make the following contributions:

  • New Sentiment Analysis Resources – We have generated sentiment lexicons for 136 major languages via graph propagation which are now publicly available11https://sites.google.com/site/datascienceslab/projects/. We validate our own work through other publicly available, human annotated sentiment lexicons. Indeed, our lexicons have polarity agreement of 95.7% with these published lexicons, plus an overall coverage of 45.2%.

  • Large-Scale Language Knowledge Graph Analysis – We have created a massive comprehensive knowledge graph of 7 million vocabulary words from 136 languages with over 131 million semantic inter-language links, which proves valuable when doing alignment between definitions in different languages.

  • Extrinsic Evaluation – We elucidate the sentiment consistency of entities reported in different language editions of Wikipedia using our propagated lexicons. In particular, we pick 30 languages and compute sentiment scores for 2,000 distinct historical figures. Each language pair exhibits a Spearman sentiment correlation of at least 0.14, with an average correlation of 0.28 over all pairs.

The rest of this paper is organized as follows. We review related work in Section 2. In Section 3, we describe our resource processing and design decisions. Section 4 discusses graph propagation methods to identify sentiment polarity across languages. Section 5 evaluates our results against each available human-annotated lexicon. Finally, in Section 6 we present our extrinsic evaluation of sentiment consistency in Wikipedia prior to our conclusions.

2 Related Work

Sentiment analysis is an important area of NLP with a large and growing literature. Excellent surveys of the field include [18, 21], establishing that rich online resources have greatly expanded opportunities for opinion mining and sentiment analysis. Godbole et al. (2007) build up an English lexicon-based sentiment analysis system to evaluate the general reputation of entities. Taboada et al. (2011) present a more sophisticated model by considering patterns, including negation and repetition using adjusted weights. Liu (2010) introduces an efficient method, at the state of the art, for doing sentiment analysis and subjectivity in English.

Researchers have investigated topic or domain dependent approaches to identify opinions. Jijkoun et al. (2010) focus on generating topic specific sentiment lexicons. Li et al. (2010) extract sentiment with global and local topic dependency. Gindl et al. (2010) perform sentiment analysis according to cross-domain contextualization and Pak and Paroubek (2010) focus on Twitter, doing research on colloquial format of English.

Work has been done to generalize sentiment analysis to other languages. Denecke (2008) performs multilingual sentiment analysis using SentiWordNet. Mihalcea et al. (2007) learn multilingual subjectivity via cross-lingual projections. Abbasi et al. (2008) extract specific language features of Arabic which requires language-specific knowledge. Gînscă et al. (2011) work on better sentiment analysis system in Romanian.

The ready availability of machine translation to and from English has prompted efforts to employ translation for sentiment analysis [6]. Banea et al. (2008) demonstrate that machine translation can perform quite well when extending the subjectivity analysis to multi-lingual environment, which makes it inspiring to replicate their work on lexicon-based sentiment analysis.

Machine learning approaches to sentiment analysis are attractive, because of the promise of reduced manual processing. Boiy and Moens (2009) conduct machine learning sentiment analysis using multilingual web texts. Deep learning approaches draft off of distributed word embedding which offer concise features reflecting the semantics of the underlying vocabulary. Turian et al. (2010) create powerful word embedding by training on real and corrupted phrases, optimizing for the replaceability of words. Zou et al. (2013) combine machine translation and word representation to generate bilingual language resources. Socher et al. (2012) demonstrates a powerful approach to English sentiment using word embedding, which can easily be extended to other languages by training on appropriate text corpora.

3 Knowledge Graph Construction

In this section we will describe how we leverage off a variety of NLP resources to construct the semantic connection graph we will use to propagate sentiment lexicons.

Figure 1: Illustration of our knowledge graph, showing links between words and edge representation to preserve source identity. For each edge between corresponding words, a 5-bit integer will record the existence of 5 possible semantic links.

The Polyglot project [3] identified the 100,000 most frequently used words in each language’s Wikipedia. Drawing a candidate lexicon from Wikipedia has some downsides (e.g. limited use of informal words), but is representative and convenient over a large number of languages. In particular, we collect total of 7,741,544 high-frequency words from 136 languages to serve as vertices in our graph.

We seek to identify as many semantic links across languages as possible to connect our network, and so integrated several resources:

  • Wiktionary – This growing resource has entries for 171 languages, edited by people with sufficient background knowledge. Wiktionary provides about 19.7% of the total links covering 382,754 vertices in our graph.

  • Machine Translation - We script the Google translation API to get even more semantic links. In particular we ask for translations of each word in our English vocabulary to 57 languages with available translators as well as going from each known vocabulary word in other languages to English. In total, machine translation provides 53.2% of the total links and establishes connections between 3.5 million vertices.

  • Transliteration Links – Natural flow brings words across languages with little morphological change. Closely related language pairs (i.e. Russian and Ukrainian) share many characters/words in common. Though not always true, words with same spelling usually have similar meanings so this can improve the coverage of semantic links. Transliteration provides 22.1% of the total links in our experiment.

  • WordNet – Finally, we gather synonyms and antonyms of English words from WordNet, which prove particularly useful in propagating sentiment across languages. In total we collect over 100,000 pairs of synonyms and antonyms and created 5.0% of the total links.

Links do not always agree in a bidirectional manner, particularly for multi-sense words, thus all links in our network are unidirectional. Figure 1 illustrates how we encode links from different resources in an integer edge value.

4 Graph Propagation

Sentiment propagation starts from English sentiment lexicons. Through semantic links in our knowledge graph, words are able to extend their sentiment polarities to adjacent neighbors. We experimented with both graph propagation algorithm [28] and label propagation algorithm [29, 22]. The primary difference between is that label propagation takes multiple paths between two vertices into consideration, while graph propagation utilizes only the best path between word pairs.

We report results from using Liu’s lexicons [17] as seed words. Liu’s lexicons contain 2006 positive words and 4783 negative words. Of these, 1422 positive words and 2956 negative words (roughly 64.5%) appear among the 100,000 English vertices in our graph.

 Dataset Propagation Acc Cov
Arabic Label 0.93 0.45
Graph 0.94 0.46
German Label 0.97 0.31
Graph 0.97 0.32
English Label 0.92 0.55
Graph 0.90 0.69
Italian Label 0.73 0.29
Graph 0.72 0.32
Japanese Label 0.57 0.12
Graph 0.56 0.15
Chinese-1 Label 0.95 0.62
Graph 0.94 0.65
Chinese-2 Label 0.97 0.70
Graph 0.97 0.72
Table 1: Graph propagation vs label propagation. Acc represents the ratio of identical polarity between our analysis and the published lexicons. Cov reflects what faction of our lexicons overlap with published lexicons.
 Language lexicon +/- Ratio Language lexicon +/- Ratio Language lexicon +/- Ratio
 Afrikaans 2299 0.40 Albanian 2076 0.41 Amharic 46 0.63
 Arabic 2794 0.41 Aragonese 97 0.47 Armenian 1657 0.43
 Assamese 493 0.49 Azerbaijani 1979 0.41 Bashkir 19 0.63
 Basque 1979 0.40 Belarusian 1526 0.43 Bengali 2393 0.42
 Bosnian 2020 0.42 Breton 184 0.42 Bulgarian 2847 0.40
 Burmese 461 0.48 Catalan 3204 0.37 Cebuano 56 0.54
 Chechen 26 0.65 Chinese 3828 0.34 Chuvash 17 0.76
 Croatian 2208 0.40 Czech 2599 0.41 Danish 3340 0.38
 Divehi 67 0.67 Dutch 3976 0.38 English 4376 0.32
 Esperanto 2604 0.40 Estonian 2105 0.41 Faroese 123 0.43
 Finnish 3295 0.40 French 4653 0.35 Frisian 224 0.43
 Gaelic 345 0.50 Galician 2714 0.37 German 3974 0.38
 Georgian 2202 0.40 Greek 2703 0.39 Gujarati 2145 0.44
 Haitian 472 0.44 Hebrew 2533 0.36 Hindi 3640 0.39
 Hungarian 3522 0.38 Icelandic 1770 0.40 Ido 183 0.49
 Interlingua 326 0.50 Indonesian 2900 0.37 Italian 4491 0.36
 Irish 1073 0.45 Japanese 1017 0.39 Javanese 168 0.51
 Kazakh 81 0.65 Kannada 2173 0.42 Kirghiz 246 0.49
 Khmer 956 0.49 Korean 2118 0.42 Kurdish 145 0.48
 Latin 2033 0.46 Latvian 1938 0.42 Limburgish 93 0.46
 Lithuanian 2190 0.41 Luxembourg 224 0.52 Macedonian 2965 0.39
 Malagasy 48 0.54 Malayalam 393 0.50 Malay 2934 0.39
 Maltese 863 0.50 Marathi 1825 0.48 Manx 90 0.51
 Mongolian 130 0.52 Nepali 504 0.49 Norwegian 3089 0.37
 Nynorsk 1894 0.39 Occitan 429 0.40 Oriya 360 0.51
 Ossetic 12 0.67 Panjabi 79 0.63 Pashto 198 0.50
 Persian 2477 0.39 Polish 3533 0.39 Portuguese 3953 0.35
 Quechua 47 0.55 Romansh 116 0.48 Romanian 3329 0.39
 Russian 2914 0.43 Sanskrit 178 0.59 Sami 24 0.71
 Serbian 2034 0.41 Sinhala 1122 0.43 Slovak 2428 0.43
 Slovene 2244 0.42 Spanish 4275 0.36 Sundanese 476 0.50
 Swahili 1314 0.42 Swedish 3722 0.39 Tamil 2057 0.40
 Tagalog 1858 0.44 Tajik 97 0.62 Tatar 76 0.50
 Telugu 2523 0.41 Thai 1279 0.51 Tibetan 24 0.63
 Turkmen 78 0.56 Turkish 2500 0.39 Uighur 18 0.44
 Ukrainian 2827 0.41 Urdu 1347 0.39 Uzbek 111 0.57
 Vietnamese 1016 0.38 Volapuk 43 0.70 Walloon 193 0.32
 Welsh 1647 0.42 Yiddish 395 0.43 Yoruba 276 0.50
Table 2: Sentiment lexicon statistics. We tag 10 languages having most/least sentiment words with blue/green color and 10 languages having highest/lowest ratio of positive words with orange/purple color.

Our knowledge network is comprised of links from a heterogeneous collection of sources, of different coverage and reliability. For the task of deciding sentiment polarity of words, only antonym links are negative. An edge gains zero weight if both negative and positive links exist. Edges having multiple positive links will be credited the highest weight among all these links. We conducted a grid search on the weight of each type of links to maximize the best overall accuracy on our test data of published non-English sentiment lexicons. To avoid potential overfitting problems, grid search starts from SentiWordNet English lexicons [9] instead of Liu’s.

5 Lexicon Evaluation

We collected all available published sentiment lexicons from non-English languages to serve as standard for our evaluation, including Arabic, Italian, German and Chinese. Coupled with English sentiment lexicons provides in total seven different test cases to experiment against, specifically:

  • Arabic: [2].

  • German: [23].

  • English: [9].

  • Italian: [5].

  • Japanese: [15].

  • Chinese-1, Chinese-2: [13].

 Type Person Z-score distribution
Good Leonardo da Vinci
 Steven Spielberg
Bad Adolf Hitler
 Osama bin Laden
Table 3: Z-score distribution examples. We label 10 languages with their language code and other using tick marks on the x-axis.

We present the accuracy and coverage achieved by two propagation model in Table 1. Both models achieve similar accuracy while slightly more words in graph propagation can be verified via published lexicons. Performance is not good on Japanese because of mismatching between our dictionary and the test data.

Table 2 reveals that very sparse sentiment lexicons resulted for a small but notable fraction of the languages we analyzed. In particular, only 20 languages yielded lexicons of less than 100 words. Without exception, they all have very small available definitions in Wikitionary. By contrast, 48 languages had lexicons with over 2,000 words, another 16 with between 1,000 and 2,000: clearly large enough to perform a meaningful analysis.

6 Extrinsic Evaluation: Consistency of Wikipedia Sentiment

We consider evaluating our lexicons on the consistency of Wikipedia pages about a particular individual person among various languages. As our candidate entities for analysis, we use the Wikipedia pages of 2,000 most significant people as measured in the recent book Who’s Bigger? [24]. The sentiment polarity of a page is simply computed by subtracting the number of negative words from that of positive words, divided by the sum of both.

The differing ratio of positive and negative polarity terms in Table 2 means that sentiment cannot be directly compared across languages. For more consistent evaluation we compute the z-score of each entity against the distribution of all its language’s entities.

We use the Spearman correlation coefficient to measure the consistence of sentiment distribution across all entities with pages in a particular language pair. Figure 2 shows the results for 30 languages with largest propagated sentiment lexicon size. All pairs of language exhibit positive correlation (and hence generally stable and consistent sentiment), with an average correlation of 0.28.

Figure 2: Heatmap of sentiment correlation between 30 languages.

Finally, Table 3 illustrates sentiment consistency over all 136 languages (represented by blue tick marks), with the first 10 languages in Figure 2 granted labels. Respected artists like Steven Spielberg and Leonardo da Vinci show as consistently positive sentiment as notorious figures like Osama bin Laden and Adolf Hitler are negative.

7 Conclusions

Our knowledge graph propagation is generally effective at producing useful sentiment lexicons. Interestingly, the ratio of positive sentiment words is strongly connected with number of sentiment words – it is noteworthy that English has the smallest ratio of positive lexicon terms. The phenomenon possibly shows that many negative words reflecting cultural nuances do not translate wel. We believe that this ratio can be considered as quality measurement of the propagation. Similar approaches can be extended to other NLP tasks using different semantic links, specific dictionary and special seed words. Future work will revolve around learning modifiers, negation terms, and various entity/sentiment attribution.

Acknowledgments

This research was partially supported by NSF Grants DBI-1060572 and IIS-1017181, and a Google Faculty Research Award.

References

  • [1] A. Abbasi, H. Chen and A. Salem(2008) Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Transactions on Information Systems (TOIS) 26 (3), pp. 12. Cited by: 2.
  • [2] M. Abdul-Mageed, M. T. Diab and M. Korayem(2011) Subjectivity and sentiment analysis of modern standard arabic.. pp. 587–591. Cited by: 5.
  • [3] R. Al-Rfou, B. Perozzi and S. Skiena(2013) Polyglot: distributed word representations for multilingual nlp. arXiv preprint arXiv:1307.1662. Cited by: 3.
  • [4] C. Banea, R. Mihalcea, J. Wiebe and S. Hassan(2008) Multilingual subjectivity analysis using machine translation. pp. 127–135. Cited by: 2.
  • [5] V. Basile and M. Nissim(2013) Sentiment analysis on italian tweets. WASSA 2013, pp. 100. Cited by: 5.
  • [6] M. Bautin, L. Vijayarenu and S. Skiena(2008) International sentiment analysis for news and blogs. Note: Second Int. Conf. on Weblogs and Social Media (ICWSM 2008) Cited by: 2.
  • [7] E. Boiy and M. Moens(2009) A machine learning approach to sentiment analysis in multilingual web texts. Information retrieval 12 (5), pp. 526–558. Cited by: 2.
  • [8] K. Denecke(2008) Using sentiwordnet for multilingual sentiment analysis. pp. 507–512. Cited by: 2.
  • [9] A. Esuli and F. Sebastiani(2006) Sentiwordnet: a publicly available lexical resource for opinion mining. Vol. 6, pp. 417–422. Cited by: 5, 1, 4.
  • [10] S. Gindl, A. Weichselbraun and A. Scharl(2010) Cross-domain contextualisation of sentiment lexicons. 19th European Conference on Artificial Intelligence (ECAI). Cited by: 2.
  • [11] N. Godbole, M. Srinivasaiah and S. Skiena(2007) Large-scale sentiment analysis for news and blogs.. ICWSM 7. Cited by: 2.
  • [12] A. Gînscă, E. Boroş, A. Iftene, D. TrandabĂţ, M. Toader, M. Corîci, C. Perez and D. Cristea(2011) Sentimatrix: multilingual sentiment analysis service. pp. 189–195. Cited by: 2.
  • [13] Y. He, H. Alani and D. Zhou(2010) Exploring english lexicon knowledge for chinese sentiment analysis. CIPS-SIGHAN Joint Conference on Chinese Language Processing. Cited by: 5.
  • [14] V. Jijkoun, M. de Rijke and W. Weerkamp(2010) Generating focused topic-specific sentiment lexicons. pp. 585–594. Cited by: 2.
  • [15] N. Kaji and M. Kitsuregawa(2007) Building lexicon for sentiment analysis from massive collection of html documents.. pp. 1075–1083. Cited by: 5.
  • [16] F. Li, M. Huang and X. Zhu(2010) Sentiment analysis with global topics and local dependency.. Cited by: 2.
  • [17] B. Liu(2010) Sentiment analysis and subjectivity. Handbook of natural language processing 2, pp. 568. Cited by: 1, 2, 4.
  • [18] B. Liu(2013) Sentiment analysis and opinion mining. Morgan and Claypool. Cited by: 2.
  • [19] R. Mihalcea, C. Banea and J. Wiebe(2007) Learning multilingual subjective language via cross-lingual projections. Vol. 45, pp. 976. Cited by: 2.
  • [20] A. Pak and P. Paroubek(2010) Twitter as a corpus for sentiment analysis and opinion mining.. Cited by: 2.
  • [21] B. Pang and L. Lee(2008) Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2 (1-2), pp. 1–135. Cited by: 2.
  • [22] D. Rao and D. Ravichandran(2009) Semi-supervised polarity lexicon induction. pp. 675–682. Cited by: 4.
  • [23] R. Remus, U. Quasthoff and G. Heyer(2010) SentiWS-a publicly available german-language resource for sentiment analysis.. Cited by: 5.
  • [24] S. Skiena and C. Ward(2013) Who’s bigger?: where historical figures really rank. Cambridge University Press. Cited by: 6.
  • [25] R. Socher, B. Huval, C. D. Manning and A. Y. Ng(2012) Semantic compositionality through recursive matrix-vector spaces. Cited by: 2.
  • [26] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede(2011) Lexicon-based methods for sentiment analysis. Computational linguistics 37 (2), pp. 267–307. Cited by: 2.
  • [27] J. Turian, L. Ratinov and Y. Bengio(2010) Word representations: a simple and general method for semi-supervised learning. pp. 384–394. Cited by: 2.
  • [28] L. Velikovich, S. Blair-Goldensohn, K. Hannan and R. McDonald(2010) The viability of web-derived polarity lexicons. pp. 777–785. Cited by: 4.
  • [29] X. Zhu and Z. Ghahramani(2002) Learning from labeled and unlabeled data with label propagation. Technical report Technical Report CMU-CALD-02-107, Carnegie Mellon University. Cited by: 4.
  • [30] W. Y. Zou, R. Socher, D. Cer and C. D. Manning(2013) Bilingual word embeddings for phrase-based machine translation. pp. 1393–1398. Cited by: 2.