Automatically detecting verbal irony (roughly, sarcasm) is a challenging task because ironists say something other than – and often opposite to – what they actually mean. Discerning ironic intent exclusively from the words and syntax comprising texts (e.g., tweets, forum posts) is therefore not always possible: additional contextual information about the speaker and/or the topic at hand is often necessary. We introduce a new corpus that provides empirical evidence for this claim. We show that annotators frequently require context to make judgements concerning ironic intent, and that machine learning approaches tend to misclassify those same comments for which annotators required additional context.
This work concerns the task of detecting verbal irony online. Our principal argument is that simple bag-of-words based text classification models – which, when coupled with sufficient data, have proven to be extremely successful for many natural language processing tasks [] – are inadequate for irony detection. In this paper we provide empirical evidence that context is often necessary to recognize ironic intent.
This is consistent with the large body of pragmatics/linguistics literature on irony and its usage, which has emphasized the role that context plays in recognizing and decoding ironic utterances []. But existing work on automatic irony detection – reviewed in Section 2 – has not explicitly attempted to operationalize such theories, and has instead relied on features (mostly word counts) intrinsic to the texts that are to be classified as ironic. These approaches have achieved some success, but necessarily face an upper-bound: the exact same sentence can be both intended ironically and unironically, depending on the context (including the speaker and the topic at hand). Only obvious verbal ironies will be recognizable from intrinsic features alone.
Here we provide empirical evidence for the above claims. We also introduce a new annotated corpus that will allow researchers to build models that augment existing approaches to irony detection with contextual information regarding the text (utterance) to be classified and its author. Briefly, our contributions are summarized as follows.
We introduce the first version of the reddit irony corpus, composed of annotated comments from the social news website reddit. Each sentence in every comment in this corpus has been labeled by three independent annotators as having been intended by the author ironically or not. This dataset is publicly available.11https://github.com/bwallace/ACL-2014-irony
We provide empirical evidence that human annotators consistently rely on contextual information to make ironic/unironic sentence judgements.
We show that the standard ‘bag-of-words’ approach to text classification fails to accurately judge ironic intent on those cases for which humans required additional context. This suggests that, as humans require context to make their judgements for this task, so too do computers.
Our hope is that these observations and this dataset will spur innovative new research on methods for verbal irony detection.
There has recently been a flurry of interesting work on automatic irony detection []. In these works, verbal irony detection has mostly been treated as a standard text classification task, though with some innovative approaches specific to detecting irony.
The most common data source used to experiment with irony detection systems has been Twitter [], though Amazon product reviews have been used experimentally as well []. Walker et al. [] also recently introduced the Internet Argument Corpus (IAC), which includes a sarcasm label (among others).
Some of the findings from these previous efforts have squared with intuition: e.g., overzealous punctuation (as in “great idea!!!!”) is indicative of ironic intent []. Other works have proposed novel approaches specifically for irony detection: Davidov et al. [], for example, proposed a semi-supervised approach in which they look for sentence templates indicative of irony. Elsewhere, Riloff et al. [] proposed a method that exploits contrasting sentiment in the same utterance to detect irony.
To our knowledge, however, no previous work on irony detection has attempted to leverage contextual information regarding the author or speaker (external to the utterance). But this is necessary in some cases, however. For example, in the case of Amazon product reviews, knowing the kinds of books that an individual typically likes might inform our judgement: someone who tends to read and review Dostoevsky is probably being ironic if she writes a glowing review of Twilight. Of course, many people genuinely do enjoy Twilight and so if the review is written subtly it will likely be difficult to discern the author’s intent without this background. In the case of Twitter, it is likely to be difficult to classify utterances without considering the contextualizing exchange of tweets (i.e., the conversation) to which they belong.
sub-reddit (URL) | description | number of labeled comments |
---|---|---|
politics (r/politics) | Political news and editorials; focus on the US. | 873 |
conservative (r/conservative) | A community for political conservatives. | 573 |
progressive (r/progressive) | A community for political progressives (liberals). | 543 |
atheism (r/atheism) | A community for non-believers. | 442 |
Christianity (r/Christianity) | News and viewpoints on the Christian faith. | 312 |
technology (r/technology) | Technology news and commentary. | 277 |
Here we introduce the first version ( 1.0) of our irony corpus. Reddit (http://reddit.com) is a social-news website to which news stories (and other links) are posted, voted on and commented upon. The forum component of reddit is extremely active: popular posts often have well into 1000’s of user comments. Reddit comprises ‘sub-reddits’, which focus on specific topics. For example, http://reddit.com/r/politics features articles (and hence comments) centered around political news. The current version of the corpus is available at: https://github.com/bwallace/ACL-2014-irony. Data collection and annotation is ongoing, so we will continue to release new (larger) versions of the corpus in the future. The present version comprises 3,020 annotated comments scraped from the six subreddits enumerated in Table 1. These comments in turn comprise a total of 10,401 labeled sentences.22We performed naïve ‘segmentation’ of comments based on punctuation.
Three university undergraduates independently annotated each sentence in the corpus. More specifically, annotators have provided binary ‘labels’ for each sentence indicating whether or not they (the annotator) believe it was intended by the author ironically (or not). This annotation was provided via a custom-built browser-based annotation tool, shown in Figure 1.
We intentionally did not provide much guidance to annotators regarding the criteria for what constitutes an ‘ironic’ statement, for two reasons. First, verbal irony is a notoriously slippery concept [] and coming up with an operational definition to be consistently applied is non-trivial. Second, we were interested in assessing the extent of natural agreement between annotators for this task. The raw average agreement between all annotators on all sentences is 0.844. Average pairwise Cohen’s Kappa [] is 0.341, suggesting fair to moderate agreement [], as we might expect for a subjective task like this one.
Reddit is a good corpus for the irony detection task in part because it provides a natural practical realization of the otherwise ill-defined context for comments. In particular, each comment is associated with a specific user (the author), and we can view their previous comments. Moreover, comments are embedded within discussion threads that pertain to the (usually external) content linked to in the corresponding submission (see Figure 2). These pieces of information (previous comments by the same user, the external link of the embedding reddit thread, and the other comments in this thread) constitute our context. All of this is readily accessible. Labelers can opt to request these pieces of context via the annotation tool, and we record when they do so.
Consider the following example comment taken from our dataset: “Great idea on the talkathon Cruz. Really made the republicans look like the sane ones.” Did the author intend this statement ironically, or was this a subtle dig on Senator Ted Cruz? Without additional context it is difficult to know. And indeed, all three annotators requested additional context for this comment. This context at first suggests that the comment may have been intended literally: it was posted in the r/conservative subreddit (Ted Cruz is a conservative senator). But if we peruse the author’s comment history, we see that he or she repeatedly derides Senator Cruz (e.g., writing “Ted Cruz is no Ronald Reagan. They aren’t even close.”). From this contextual information, then, we can reasonably assume that the comment was intended ironically (and all three annotators did so after assessing the available contextual information).
We explore the extent to which human annotators rely on contextual information to decide whether or not sentences were intended ironically. Recall that our annotation tool allows labelers to request additional context if they cannot make a decision based on the comment text alone (Figure 1). On average, annotators requested additional context for 30% of comments (range across annotators of 12% to 56%). As shown in Figure 3, annotators are consistently more confident once they have consulted this information.
We tested for a correlation between these requests for context and the final decisions regarding whether comments contain at least one ironic sentence. We denote the probability of at least one annotator requesting additional context for comment by . We then model the probability of this event as a linear function of whether or not any annotator labeled any sentence in comment as ironic. We code this via the indicator variable which is 1 when comment has been deemed to contain an ironic sentence (by any of the three annotators) and 0 otherwise.
(1) |
We used the regression model shown in Equation 1, where is an intercept and captures the correlation between requests for context for a given comment and its ultimately being deemed to contain at least one ironic sentence. We fit this model to the annotated corpus, and found a significant correlation: with a 95% confidence interval of (1.326, 1.690); .
In other words, annotators request context significantly more frequently for those comments that (are ultimately deemed to) contain an ironic sentence. This would suggest that the words and punctuation comprising online comments alone are not sufficient to distinguish ironic from unironic comments. Despite this, most machine learning based approaches to irony detection have relied nearly exclusively on such intrinsic features.
We show that the misclassifications (with respect to whether comments contain irony or not) made by a standard text classification model significantly correlate with those comments for which human annotators requested additional context. This provides evidence that bag-of-words approaches are insufficient for the general task of irony detection: more context is necessary.
We implemented a baseline classification approach using vanilla token count features (binary bag-of-words). We removed stop-words and limited the vocabulary to the 50,000 most frequently occurring unigrams and bigrams. We added additional binary features coding for the presence of punctuational features, such as exclamation points, emoticons (for example, ‘;)’) and question marks: previous work [] has found that these are good indicators of ironic intent.
For our predictive model, we used a linear-kernel SVM (tuning the parameter via grid-search over the training dataset to maximize F1 score). We performed five-fold cross-validation, recording the predictions for each (held-out) comment . Average F1 score over the five-folds was 0.383 with range (0.330, 0.412); mean recall was 0.496 (0.446, 0.548) and average precision was 0.315 (0.261, 0.380). The five most predictive tokens were: !, yeah, guys, oh and shocked. This represents reasonable performance (with intuitive predictive tokens); but obviously there is quite a bit of room for improvement.33Some of the recently proposed strategies mentioned in Section 2 may improve performance here, but none of these address the fundamental issue of context.
We now explore empirically whether these misclassifications are made on the same comments for which annotators requested context. To this end, we introduce a variable for each comment such that if , i.e., is an indicator variable that encodes whether or not the classifier misclassified comment . We then ran a second regression in which the output variable was the logit-transformed probability of the model misclassifying comment , i.e., . Here we are interested in the correlation of the event that one or more annotators requested additional context for comment (denoted by ) and model misclassifications (adjusting for the comment’s true label). Formally:
(2) |
Fitting this to the data, we estimated with a CI of (0.810, 1.133); . Put another way, the model makes mistakes on those comments for which annotators requested additional context (even after accounting for the annotator designation of comments).
We have described a new (publicly available) corpus for the task of verbal irony detection. The data comprises comments scraped from the social news website reddit. We recorded confidence judgements and requests for contextualizing information for each comment during annotation. We analyzed this corpus to provide empirical evidence that annotators quite often require context beyond the comment under consideration to discern irony; especially for those comments ultimately deemed as being intended ironically. We demonstrated that a standard token-based machine learning approach misclassified many of the same comments for which annotators tend to request context.
We have shown that annotators rely on contextual cues (in addition to word and grammatical features) to discern irony and argued that this implies computers should, too. The obvious next step is to develop new machine learning models that exploit the contextual information available in the corpus we have curated (e.g., previous comments by the same user, the thread topic).
This work was made possible by the Army Research Office (ARO), grant #64481-MA.