We investigate the novel task of online dispute detection and propose a sentiment analysis solution to the problem: we aim to identify the sequence of sentence-level sentiments expressed during a discussion and to use them as features in a classifier that predicts the DISPUTE/NON-DISPUTE label for the discussion as a whole. We evaluate dispute detection approaches on a newly created corpus of Wikipedia Talk page disputes and find that classifiers that rely on our sentiment tagging features outperform those that do not. The best model achieves a very promising F1 score of 0.78 and an accuracy of 0.80.
As the web has grown in popularity and scope, so has the promise of collaborative information environments for the joint creation and exchange of knowledge [11, 18]. Wikipedia, a wiki-based online encyclopedia, is arguably the best example: its distributed editing environment allows readers to collaborate as content editors and has facilitated the production of over four billion articles11http://en.wikipedia.org of surprisingly high quality [6] in English alone since its debut in 2001.
Existing studies of collaborative knowledge systems have shown, however, that the quality of the generated content (e.g. an encyclopedia article) is highly correlated with the effectiveness of the online collaboration [12, 14]; fruitful collaboration, in turn, inevitably requires dealing with the disputes and conflicts that arise [13]. Unfortunately, human monitoring of the often massive social media and collaboration sites to detect, much less mediate, disputes is not feasible.
In this work, we investigate the heretofore novel task of dispute detection in online discussions. Previous work in this general area has analyzed dispute-laden content to discover features correlated with conflicts and disputes [13]. Research focused primarily on cues derived from the edit history of the jointly created content (e.g. the number of revisions, their temporal density [13, 23]) and relied on small numbers of manually selected discussions known to involve disputes. In contrast, we investigate methods for the automatic detection, i.e. prediction, of discussions involving disputes. We are also interested in understanding whether, and which, linguistic features of the discussion are important for dispute detection.
Drawing inspiration from studies of human mediation of online conflicts (e.g. Billings and Watts (2010), Kittur et al. (2007), Kraut and Resnick (2012)), we hypothesize that effective methods for dispute detection should take into account the sentiment and opinions expressed by participants in the collaborative endeavor. As a result, we propose a sentiment analysis approach for online dispute detection that identifies the sequence of sentence-level sentiments (i.e. very negative, negative, neutral, positive, very positive) expressed during the discussion and uses them as features in a classifier that predicts the dispute/non-dispute label for the discussion as a whole. Consider, for example, the snippet in Figure 1 from the Wikipedia Talk page for the article on Philadelphia; it discusses the choice of a picture for the article’s “infobox”. The sequence of almost exclusively negative statements provides evidence of a dispute in this portion of the discussion.
1-Emy111: I think everyone is forgetting that my previous image was the lead image for well over a year! … |
Massimo: I’m sorry to say so, but it is grossly over processed… |
2-Emy111: i’m glad you paid more money for a camera than I did. congrats… i appreciate your constructive criticism. thank you. |
Massimo: I just want to have the best picture as a lead for the article … |
3-Emy111: Wow, I am really enjoying this photography debate… [so don’t make assumptions you know nothing about.] [Really, grow up.] [If you all want to complain about Photoshop editing, lets all go buy medium format film cameras, shoot film, and scan it, so no manipulation is possible.] [Sound good?] |
Massimo: … I do feel it is a pity, that you turned out to be a sore loser… |
Unfortunately, sentence-level sentiment tagging for this domain is challenging in its own right due to the less formal, often ungrammatical, language and the dynamic nature of online conversations. “Really, grow up” (segment 3) should presumably be tagged as a negative sentence as should the sarcastic sentences “Sounds good?” (in the same turn) and “congrats” and “thank you” (in segment 2). We expect that these, and other, examples will be difficult for the sentence-level classifier unless the discourse context of each sentence is considered. Previous research on sentiment prediction for online discussions, however, focuses on turn-level predictions [7, 24].22A notable exception is Hassan et al. (2010), which identifies sentences containing “attitudes” (e.g. opinions), but does not distinguish them w.r.t. sentiment. Context information is also not considered. As the first work that predicts sentence-level sentiment for online discussions, we investigate isotonic Conditional Random Fields (CRFs) [16] for the sentiment-tagging task as they preserve the advantages of the popular CRF-based sequential tagging models [15] while providing an efficient mechanism for encoding domain knowledge — in our case, a sentiment lexicon — through isotonic constraints on model parameters.
We evaluate our dispute detection approach using a newly created corpus of discussions from Wikipedia Talk pages (3609 disputes, 3609 non-disputes).33The talk page associated with each article records conversations among editors about the article content and allows editors to discuss the writing process, e.g. planning and organizing the content. We find that classifiers that employ the learned sentiment features outperform others that do not. The best model achieves a very promising F1 score of 0.78 and an accuracy of 0.80 on the Wikipedia dispute corpus. To the best of our knowledge, this represents the first computational approach to automatically identify online disputes on a dataset of scale.
Sentiment analysis has been utilized as a key enabling technique in a number of conversation-based applications. Previous work mainly studies the attitudes in spoken meetings [5, 7] or broadcast conversations [21] using variants of Conditional Random Fields [15] and predicts sentiment at the turn-level, while our predictions are made for each sentence.
We construct the first dispute detection corpus to date; it consists of dispute and non-dispute discussions from Wikipedia Talk pages.
Step 1: Get Talk Pages of Disputed Articles. Wikipedia articles are edited by different editors. If an article is observed to have disputes on its talk page, editors can assign dispute tags to the article to flag it for attention. In this research, we are interested in talk pages whose corresponding articles are labeled with the following tags: disputed, totallydisputed, disputed-section, totallydisputed-section, pov. The tags indicate that an article is disputed, or the neutrality of the article is disputed (pov).
We use the 2013-03-04 Wikipedia data dump, and extract talk pages for articles that are labeled with dispute tags by checking the revision history. This results in 19,071 talk pages.
Step 2: Get Discussions with Disputes. Dispute tags can also be added to talk pages themselves. Therefore, in addition to the tags mentioned above, we also consider the “Request for Comment” (rfc) tag on talk pages. According to Wikipedia44http://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment, rfc is used to request outside opinions concerning the disputes.
3609 discussions are collected with dispute tags found in the revision history. We further classify dispute discussions into three subcategories: Controversy, Request for Comment (RFC), and Resolved based on the tags found in discussions (see Table 1). The numbers of discussions for the three types are 42, 3484, and 105, respectively. Note that dispute tags only appear in a small number of articles and talk pages. There may exist other discussions with disputes.
Dispute Subcategory | Wikipedia Tags on Talk pages |
---|---|
Controversy | Controversial, totallydisputed, |
Disputed, Calm talk, POV | |
Request for Comment | rfc |
Resolved | Any tag from above + Resolved |
Step 3: Get Discussions without Disputes. Likewise, we collect non-dispute discussions from pages that are never tagged with disputes. We consider non-dispute discussions with at least 3 distinct speakers and 10 turns. 3609 discussions are randomly selected with this criterion. The average turn numbers for dispute and non-dispute discussions are 45.03 and 22.95, respectively.
This section describes our sentence-level sentiment tagger, from which we construct features for dispute detection (Section 4).
Consider a discussion comprised of sequential turns; each turn consists of a sequence of sentences. Our model takes as input the sentences from a single turn, and outputs the corresponding sequence of sentiment labels , where . The labels in represent very negative (NN), negative (N), neutral (O), positive (P), and very positive (PP), respectively.
Given that traditional Conditional Random Fields (CRFs) [15] ignore the ordinal relations among sentiment labels, we choose isotonic CRFs [16] for sentence-level sentiment analysis as they can enforce monotonicity constraints on the parameters consistent with the ordinal structure and domain knowledge (e.g. word-level sentiment conveyed via a lexicon). Concretely, we take a lexicon , where and are two sets of features (usually words) identified as strongly associated with positive and negative sentiment. Assume encodes the weight between label and feature , for each feature ; then the isotonic CRF enforces . For example, when “totally agree” is observed in training, parameter is likely to increase. Similar constraints are defined on .
Our lexicon is built by combining MPQA [22], General Inquirer [19], and SentiWordNet [3] lexicons. Words with contradictory sentiments are removed. We use the features in Table 2 for sentiment prediction.
Lexical Features | Syntactic/Semantic Features |
---|---|
- unigram/bigram | - unigram with POS tag |
- number of words all uppercased | - dependency relation |
- number of words | Conversation Features |
Discourse Features | - quote overlap with target |
- initial uni-/bi-/tri-gram | - TFIDF similarity with target |
- repeated punctuations | (remove quote first) |
- hedging phrases collected from | Sentiment Features |
Farkas et al. (2010) | - connective + sentiment words |
- number of negators | - sentiment dependency relation |
- sentiment words | |
Syntactic/Semantic Features. We have two versions of dependency relation features, the original form and a form that generalizes a word to its POS tag, e.g. “nsubj(wrong, you)” is generalized to “nsubj(ADJ, you)” and “nsubj(wrong, PRP)”.
Discourse Features. We extract the initial unigram, bigram, and trigram of each utterance as discourse features [9].
Sentiment Features. We gather connectives from the Penn Discourse TreeBank [17] and combine them with any sentiment word that precedes or follows it as new features. Sentiment dependency relations are the dependency relations that include a sentiment word. We replace those words with their polarity equivalents. For example, relation “nsubj(wrong, you)” becomes “nsubj(SentiWord, you)”.
Dataset. We train the sentiment classifier using the Authority and Alignment in Wikipedia Discussions (AAWD) corpus [1] on a 5-point scale (i.e. NN, N, O, P, PP). AAWD consists of 221 English Wikipedia discussions with positive and negative alignment annotations. Annotators either label each sentence as positive, negative or neutral, or label the full turn. For instances that have only a turn-level label, we assume all sentences have the same label as the turn. We further transform the labels into the five sentiment labels. Sentences annotated as being a positive alignment by at least two annotators are treated as very positive (PP). If a sentence is only selected as positive by one annotator or obtains the label via turn-level annotation, it is positive (P). Very negative (NN) and negative (N) are collected in the same way. All others are neutral (O). Among all 16,501 sentences in AAWD, 1,930 and 1,102 are labeled as NN and N. 532 and 99 of them are PP and P. The other 12,648 are considered neutral.
Evaluation. To evaluate the performance of the sentiment tagger, we compare to two baselines. (1) Baseline (Polarity): a sentence is predicted as positive if it has more positive words than negative words, or negative if more negative words are observed. Otherwise, it is neutral. (2) Baseline (Distance) is extended from [8]. Each sentiment word is associated with the closest second person pronoun, and a surface distance is computed. An SVM classifier [10] is trained using features of the sentiment words and minimum/maximum/average of the distances.
We also compare with two state-of-the-art methods that are used in sentiment prediction for conversations: (1) an SVM (RBF kernel) that is employed for identifying sentiment-bearing sentences [8], and (dis)agreement detection [24] in online debates; (2) a Linear CRF for (dis)agreement identification in broadcast conversations [21].
Pos | Neg | Neutral | |
---|---|---|---|
Baseline (Polarity) | 22.53 | 38.61 | 66.45 |
Baseline (Distance) | 33.75 | 55.79 | 88.97 |
SVM (3-way) | 44.62 | 52.56 | 80.84 |
CRF (3-way) | 56.28 | 56.37 | 89.41 |
CRF (5-way) | 58.39 | 56.30 | 90.10 |
isotonic CRF | 68.18 | 62.53 | 88.87 |
We evaluate the systems using standard F1 on classes of positive, negative, and neutral, where samples predicted as PP and P are positive alignment, and samples tagged as NN and N are negative alignment. Table 3 describes the main results on the AAWD dataset: our isotonic CRF based system significantly outperforms the alternatives for positive and negative alignment detection (paired- test, ).
Sample sentences (sentiment in parentheses) |
A: no, I sincerely plead with you… (N) If not, you are just wasting my time. (NN) |
B: I believe Sweet’s proposal… is quite silly. (NN) |
C: Tell you what. (NN) If you can get two other editors to agree… I will shut up and sit down. (NN) |
D: But some idiot forging your signature claimed that doing so would violate. (NN)… Please go have some morning coffee. (O) |
E: And I don’t like coffee. (NN) Good luck to you. (NN) |
F: Was that all? (NN)… I think that you are in error… (N) |
A: So far so confusing. (NN)… |
B: … I can not see a rationale for the landrace having its own article… (N) With Turkish Van being a miserable stub, there’s no such rationale for forking off a new article… (NN)… |
C: I’ve also copied your post immediately above to that article’s talk page since it is a great “nutshell” summary. (PP) |
D: Err.. how can the opposite be true… (N) |
E: Thanks for this, though I have to say some of the facts floating around this discussion are wrong. (P) |
F: Great. (PP) Let’s make sure the article is clear on this. (O) |
We model dispute detection as a standard binary classification task, and investigate four major types of features as described below.
Lexical Features. We first collect unigram and bigram features for each discussion.
Topic Features. Articles on specific topics, such as politics or religions, tend to arouse more disputes. We thus extract the category information of the corresponding article for each talk page. We further utilize unigrams and bigrams of the category as topic features.
Discussion Features. This type of feature aims to capture the structure of the discussion. Intuitively, the more turns or the more participants a discussion has, the more likely there is a dispute. Meanwhile, participants tend to produce longer utterances when they make arguments. We choose number of turns, number of participants, average number of words in each turn as features. In addition, the frequency of revisions made during the discussion has been shown to be good indicator for controversial articles [20], that are presumably prone to have disputes. Therefore, we encode the number of revisions that happened during the discussion as a feature.
Sentiment Features. This set of features encode the sentiment distribution and transition in the discussion. We train our sentiment tagging model on the full AAWD dataset, and run it on the Wikipedia dispute corpus.
Given that consistent negative sentiment flow usually indicates an ongoing dispute, we first extract features from sentiment distribution in the form of number/probability of sentiment per type. We also estimate the sentiment transition probability from our predictions, where and are sentiment labels for the current sentence and the next. We then have features as number/portion of sentiment transitions per type.
Features described above mostly depict the global sentiment flow in the discussions. We further construct a local version of them, since sentiment distribution may change as discussion proceeds. For example, less positive sentiment can be observed as dispute being escalated. We thus split each discussion into three equal length stages, and create sentiment distribution and transition features for each stage.
Prec | Rec | F1 | Acc | |
Baseline (Random) | 50.00 | 50.00 | 50.00 | 50.00 |
Baseline (All dispute) | 50.00 | 100.00 | 66.67 | 50.00 |
Logistic Regression | 74.76 | 72.29 | 73.50 | 73.94 |
SVM | 69.81 | 71.90 | 70.84 | 70.41 |
SVM | 77.38 | 79.14 | 78.25 | 80.00 |
Prec | Rec | F1 | Acc | |
---|---|---|---|---|
Lexical (Lex) | 75.86 | 34.66 | 47.58 | 61.82 |
Topic (Top) | 68.44 | 71.46 | 69.92 | 69.26 |
Discussion (Dis) | 69.73 | 76.14 | 72.79 | 71.54 |
Sentiment (Senti) | 72.54 | 69.52 | 71.00 | 71.60 |
Top + Dis | 68.49 | 71.79 | 70.10 | 69.38 |
Top + Dis + Senti | 77.39 | 78.36 | 77.87 | 77.74 |
Top + Dis + Senti | 77.38 | 79.14 | 78.25 | 80.00 |
Lex + Top + Dis + Senti | 78.38 | 75.12 | 76.71 | 77.20 |
Results and Error Analysis. We experiment with logistic regression, SVM with linear and RBF kernels, which are effective methods in multiple text categorization tasks [10, 25]. We normalize the features by standardization and conduct a 5-fold cross-validation. Two baselines are listed: (1) labels are randomly assigned; (2) all discussions have disputes.
Main results for different classifiers are displayed in Table 4. All learning based methods outperform the two baselines, and among them, SVM with the RBF kernel achieves the best F1 score and accuracy (0.78 and 0.80). Experimental results with various combinations of features sets are displayed in Table 5. As it can be seen, sentiment features obtains the best accuracy among the four types of features. A combination of topic, discussion, and sentiment features achieves the best performance on recall, F1, and accuracy. Specifically, the accuracy is significantly higher than all the other systems (paired- test, ).
After a closer look at the results, we find two main reasons for incorrect predictions. Firstly, sentiment prediction errors get propagated into dispute detection. Due to the limitation of existing general-purpose lexicons, some opinionated dialog-specific terms are hard to catch. For example, “I told you over and over again…” strongly suggests a negative sentiment, but no single word shows negative connotation. Constructing a lexicon tuned for conversational text may improve the performance. Secondly, some dispute discussions are harder to detect than the others due to different dialog structures. For instance, the recalls for dispute discussions of “controversy”, “RFC”, and “resolved” are 0.78, 0.79, and 0.86 respectively. We intend to design models that are able to capture dialog structures in the future work.
Sentiment Flow Visualization. We visualize the sentiment flow of two disputed discussions in Figure 2. The plots reveal persistent negative sentiment in unresolved disputes (top). For the resolved dispute (bottom), participants show gratitude when the problem is settled.
We present a sentiment analysis-based approach to online dispute detection. We create a large-scale dispute corpus from Wikipedia Talk pages to study the problem. A sentiment prediction model based on isotonic CRFs is proposed to output sentiment labels at the sentence-level. Experiments on our dispute corpus also demonstrate that classifiers trained with sentiment tagging features outperform others that do not.
Acknowledgments We heartily thank the Cornell NLP Group, the reviewers, and Yiye Ruan for helpful comments. We also thank Emily Bender and Mari Ostendorf for providing the AAWD dataset. This work was supported in part by NSF grants IIS-0968450 and IIS-1314778, and DARPA DEFT Grant FA8750-13-2-0015. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF, DARPA or the U.S. Government.