This paper describes a new approach to predicting the aspectual class of verbs in context, i.e., whether a verb is used in a stative or dynamic sense. We identify two challenging cases of this problem: when the verb is unseen in training data, and when the verb is ambiguous for aspectual class. A semi-supervised approach using linguistically-motivated features and a novel set of distributional features based on representative verb types allows us to predict classes accurately, even for unseen verbs. Many frequent verbs can be either stative or dynamic in different contexts, which has not been modeled by previous work; we use contextual features to resolve this ambiguity. In addition, we introduce two new datasets of clauses marked for aspectual class.
In this work, we focus on the automatic prediction of whether a verb in context is used in a stative or in a dynamic sense, the most fundamental distinction in all taxonomies of aspectual class. The aspectual class of a discourse’s finite verbs is an important factor in conveying and interpreting temporal structure [21, 10, 18]; others are tense, grammatical aspect, mood and whether the utterance represents an event as completed. More accurate temporal information processing is expected to be beneficial for a variety of natural language processing tasks [7, 32].
While most verbs have one predominant interpretation, others are more flexible for aspectual class and can occur as either stative (1) or dynamic (1) depending on the context. There are also cases that allow for both readings, such as (1).
The liquid fills the container. (stative)
The pool slowly filled with water. (dynamic)
Your soul was made to be filled with God Himself. (both) (Brown corpus, religion)
Cases like (1) do not imply that there is a third class, but rather that two interpretations are available for the sentence, of which usually one will be chosen by a reader.
Following Siegel and McKeown (2000), we aim to automatically classify clauses for fundamental aspectual class, a function of the main verb and a select group of complements, which may differ per verb [26, 28]. This corresponds to the aspectual class of the clause’s main verb when ignoring any aspectual markers or transformations. For example, English sentences with perfect tense are usually considered to introduce states to the discourse [29, 17], but we are interested in the aspectual class before this transformation takes place. The clause John has kissed Mary introduces a state, but the fundamental aspectual class of the ‘tenseless’ clause John kiss Mary is dynamic.
In contrast to Siegel and McKeown (2000), we do not conduct the task of predicting aspectual class solely at the type level, as such an approach ignores the minority class of ambiguous verbs. Instead we predict the aspectual class of verbs in the context of their arguments and modifiers. We show that this method works better than using only type-based features, especially for verbs with ambiguous aspectual class. In addition, we show that type-based features, including novel distributional features based on representative verbs, accurately predict predominant aspectual class for unseen verb types. Our work differs from prior work in that we treat the problem as a three-way classification task, predicting dynamic, stative or both as the aspectual class of a verb in context.
Aspectual class is well treated in the linguistic literature [33, 12, 29, for example]. Our notion of the stative/dynamic distinction corresponds to Bach’s [1] distinction between states and non-states; to states versus occurrences (events and processes) according to Mourelatos (1978); and to Vendler’s [33] distinction between states and the other three classes (activities, achievements, accomplishments).
Early studies on the computational modeling of aspectual class [23, 24, 5, 18] laid foundations for a cluster of papers published over a decade ago [26, 28, 27]. Since then, it has mostly been treated as a subtask within temporal reasoning, such as in efforts related to TimeBank [25] and the TempEval challenges [34, 35, 32], where top-performing systems [16, 3, 6] use corpus-based features, WordNet synsets, parse paths and features from typed dependencies to classify events as a joint task with determining the event’s span. Costa and Branco (2012) explore the usefulness of a wider range of explicitly aspectual features for temporal relation classification.
Siegel and McKeown (2000) present the most extensive study of predicting aspectual class, which is the main inspiration for this work. While all of their linguistically motivated features (see section 4.1) are type-based, they train on and evaluate over labeled verbs in context. Their data set taken from medical discharge summaries comprises 1500 clauses containing main verbs other than be and have which are marked for aspectual class. Their model fails to outperform a baseline of memorizing the most frequent class of a verb type, and they present an experiment testing on unseen verb types only for the related task of classifying completedness of events. We replicate their method using publicly available software, create a similar but larger corpus,11Direct comparison on their data is not possible; feature values for the verbs studied are available, but full texts and the English Slot Grammar parser [20] are not. and show that it is indeed possible to predict the aspectual class of unseen verbs. Siegel (1998a) investigates a classification method for the verb have in context; inspired by this work, our present work goes one step further and uses a larger set of instance-based contextual features to perform experiments on a set of 20 verbs. To the best of our knowledge, there is no previous work comprehensively addressing aspectual classification of verbs in context.
complete | w/o have/be/none | |||
genre | clauses | clauses | ||
jokes | 3462 | 0.85 | 2660 | 0.77 |
letters | 1848 | 0.71 | 1444 | 0.62 |
news | 2565 | 0.79 | 2075 | 0.69 |
all | 7875 | 0.80 | 6161 | 0.70 |
dynamic | stative | both | |
---|---|---|---|
dynamic | 4464 | 164 | 9 |
stative | 434 | 1056 | 29 |
both | 5 | 0 | 0 |
The Asp-MASC corpus consists of 7875 clauses from the letters, news and jokes sections of MASC [15],
each labeled by two annotators for the aspectual class of the main verb.22Corpus freely available from
www.coli.uni-saarland.de/~afried. Texts were segmented into clauses using SPADE [30] with some heuristic post-processing. We parse the corpus using the Stanford dependency parser [8] and extract the main verb of each segment. We use 6161 clauses for the classification task, omitting clauses with have or be as the main verb and those where no main verb could be identified due to parsing errors (none).
Table 1 shows inter-annotator agreement; Table 2 shows the confusion matrix for the two annotators.
Our two annotators exhibit different preferences on the 598 cases where they disagree between dynamic and stative.
Such differences in annotation preferences are not uncommon [2].
We observe higher agreement in the jokes and news subcorpora than for letters; texts in the letters subcorpora are largely argumentative and thus have a different rhetorical style than the more straightforward narratives and reports found in jokes. Overall, we find substantial agreement.
The data for our experiments uses the label dynamic or stative whenever annotators agree, and both whenever they disagree or when at least one annotator marked the clause as both, assuming that both readings are possible in such cases. Because we don’t want to model the authors’ personal view of the theory, we refrain from applying an adjudication step and model the data as is.
dynamic | stative | both | |
---|---|---|---|
dynamic | 1444 | 201 | 54 |
stative | 168 | 697 | 20 |
both | 44 | 31 | 8 |
In order to facilitate a first study on ambiguous verbs, we select 20 frequent verbs from the list of ‘mixed’ verbs (see section 3) and for each mark 138 sentences. Sentences are extracted randomly from the Brown corpus, such that the distribution of stative/dynamic usages is expected to be natural. We present entire sentences to the annotators who mark the aspectual class of the verb in question as highlighted in the sentence. The data is processed in the same way as Asp-MASC, discarding instances with parsing problems. This results in 2667 instances. is 0.6, the confusion matrix is shown in Table 3. Details are listed in Table 10.
For predicting the aspectual class of verbs in context (stative, dynamic, both), we assume a supervised learning setting and explore features mined from a large background corpus, distributional features, and instance-based features. If not indicated otherwise, experiments use a Random Forest classifier [4] trained with the implementation and standard parameter settings from Weka [14].
This set of corpus-based features is a reimplementation of the linguistic indicators of Siegel and McKeown (2000), who show that (some of) these features correlate with either stative or dynamic verb types. We parse the AFE and XIE sections of Gigaword [13] with the Stanford dependency parser. For each verb type, we obtain a normalized count showing how often it occurs with each of the indicators in Table 4, resulting in one value per feature per verb. For example, for the verb fill, the value of the feature temporal-adverb is 0.0085, meaning that 0.85% of the occurrences of fill in the corpus are modified by one of the temporal adverbs on the list compiled by Siegel (1998b). Tense, progressive, perfect and voice are extracted using a set of rules following Loaiciga et al. (2014).33We thank the authors for providing us their code.
Feature | Example | Feature | Example | |
---|---|---|---|---|
frequency | - | continuous | continually | |
present | says | adverb | endlessly | |
past | said | evaluation | better | |
future | will say | adverb | horribly | |
perfect | had won | manner | furiously | |
progressive | is winning | adverb | patiently | |
negated | not/never | temporal | again | |
particle | up/in/… | adverb | finally | |
no subject | - | in-PP | in an hour | |
for-PP | for an hour |
Feature | Values |
---|---|
part-of-speech tag of the verb | VB, VBG, VBN, … |
tense | present, past, future |
progressive | true/false |
perfect | true/false |
voice | active/passive |
grammatical dependents | WordNet lexname/POS |
We aim to leverage existing, possibly noisy sets of representative stative, dynamic or mixed verb types extracted from LCS (see section 3), making up for unseen verbs and noise by averaging over distributional similarities. Using an existing large distributional model [31] estimated over the set of Gigaword documents marked as stories, for each verb type, we build a syntactically informed vector representing the contexts in which the verb occurs. We compute three numeric feature values per verb type, which correspond to the average cosine similarities with the verb types in each of the three seed sets.
Table 5 shows our set of instance-based syntactic and semantic features. In contrast to the above described type-based features, these features do not rely on a background corpus, but are extracted from the clause being classified. Tense, progressive, perfect and voice are extracted from dependency parses as described above. For features encoding grammatical dependents, we focus on a subset of grammatical relations. The feature value is either the WordNet lexical filename (e.g. noun.person) of the given relation’s argument or its POS tag, if the former is not available. We simply use the most frequent sense for the dependent’s lemma. We also include features that indicate, if there are any, the particle of the verb and its prepositional dependents. For the sentence A little girl had just finished her first week of school, the instance-based feature values would include tense:past, subj:noun.person, dobj:noun.time or particle:none.
Features | Accuracy (%) |
---|---|
Baseline (Lemma) | |
LingInd | |
Inst | |
Inst+Lemma | |
Dist | |
LingInd+Inst+Dist+Lemma |
The experiments presented in this section aim to evaluate the effectiveness of the feature sets described in the previous section, focusing on the challenging cases of verb types unseen in the training data and highly ambiguous verbs. The feature Lemma indicates that the verb’s lemma is used as an additional feature.
The setting of our first experiment follows Siegel and McKeown (2000). Table 6 reports results for 10-fold cross-validation, with occurrences of all verbs distributed evenly over the folds. No feature combination significantly44According to McNemar’s test with Yates’ correction for continuity, . outperforms the baseline of simply memorizing the most frequent class of a verb type in the respective training folds.
This experiment shows a successful case of semi-supervised learning: while type-based feature values can be estimated from large corpora in an unsupervised way, some labeled training data is necessary to learn their best combination. This experiment specifically examines performance on verbs not seen in labeled training data. We use 10-fold cross validation but ensure that all occurrences of a verb type appear in the same fold: verb types in each test fold have not been seen in the respective training data, ruling out the Lemma feature. A Logistic regression classifier [14] works better here (using only numeric features), and we present results in Table 7. Both the LingInd and Dist features generalize across verb types, and their combination works best.
Features | Accuracy (%) | |
---|---|---|
1 | Baseline | |
2 | Dist | |
3 | LingInd | |
4 | LingInd+Dist |
Data | Features | Acc. (%) |
---|---|---|
one-label | Baseline | |
verbs | LingInd | |
Dist | ||
(1966 inst.) | Inst+Lemma | |
LingInd+Inst+Lemma | ||
multi-label | Baseline | |
verbs | LingInd | |
Dist | ||
(4195 inst.) | Inst | |
Inst+Lemma | ||
LingInd+Inst+Lemma | ||
LingInd+Inst+Lemma+Dist |
System | Class | Acc. | P | R | F |
---|---|---|---|---|---|
baseline | micro-avg. | 0.75 | 0.79 | 0.76 | |
LingInd | dynamic | 0.84 | 0.95 | 0.89 | |
+Inst | stative | 0.76 | 0.69 | 0.72 | |
+Lemma | both | 0.51 | 0.24 | 0.33 | |
micro-avg. | 0.78 | 0.81 | 0.79 |
For this experiment, we compute results separately for one-label verbs (those for which all instances in Asp-MASC have the same label) and for multi-label verbs (instances have differing labels in Asp-MASC). We expect one-label verbs to have a strong predominant aspectual class, and multi-label verbs to be more flexible. Otherwise, the experimental setup is as in experiment 1. Results appear in Table 8. In each case, the linguistic indicator features again perform on par with the baseline. For multi-label verbs, the feature combination Lemma+LingInd+Inst leads to significant5 improvement of 2% gain in accuracy over the baseline; Table 9 reports detailed class statistics and reveals a gain in F-measure of 3 points over the baseline. To sum up, Inst features are essential for classifying multi-label verbs, and the LingInd features provide some useful prior. These results motivate the need for an instance-based approach.
Inst +Lemma |
Inst +Lemma +LingInd +Dist | ||||
# of | Majority | ||||
Verb | inst. | Class10 | |||
feel | 128 | 96.1 | stat | ||
say | 138 | 94.9 | dyn | 93.5 | 93.5 |
make | 136 | 91.9 | dyn | 91.9 | 91.2 |
come | 133 | 88.0 | dyn | 87.2 | 87.2 |
take | 137 | 85.4 | dyn | 85.4 | 85.4 |
meet | 130 | 83.9 | dyn | 86.2 | 87.7 |
stand | 130 | 80.0 | stat | 79.2 | 83.1 |
find | 137 | 74.5 | dyn | 69.3 | 68.8 |
accept | 134 | 70.9 | dyn | 64.9 | 65.7 |
hold | 134 | 56.0 | both | 43.3 | 49.3 |
carry | 136 | 55.9 | dyn | 55.9 | 58.1 |
look | 138 | 55.8 | dyn | 72.5 | 74.6 |
show | 133 | 54.9 | dyn | 69.2 | 68.4 |
appear | 136 | 52.2 | stat | 64.7 | 61.0 |
follow | 122 | 51.6 | both | 69.7 | 65.6 |
consider | 138 | 50.7 | dyn | 61.6 | 70.3 |
cover | 123 | 50.4 | stat | 46.3 | 54.5 |
fill | 134 | 47.8 | dyn | 66.4 | 62.7 |
bear | 135 | 47.4 | dyn | 70.4 | 67.4 |
allow | 135 | 37.8 | dyn | 48.9 | 51.9 |
micro-avg. | 2667 | 66.3 | 71.0* | 72.0* |
For verbs with ambiguous aspectual class, type-based classification is not sufficient, as this approach selects a dominant sense for any given verb and then always assigns that. Therefore we propose handling ambiguous verbs separately. As Asp-MASC contains only few instances of each of the ambiguous verbs, we turn to the Asp-Ambig dataset. We perform a Leave-One-Out (LOO) cross validation evaluation, with results reported in Table 10.55 The third column also shows the outcome of using either only the Lemma, only LingInd or only Dist in LOO; all have almost the same outcome as using the majority class, numbers differ only after the decimal point. Using the Inst features alone (not shown in Table 10) results in a micro-average accuracy of only 58.1%: these features are only useful when combined with the feature Lemma. For classifying verbs whose most frequent class occurs less than 56% of the time, Lemma+Inst features are essential. Whether or not performance is improved by adding LingInd/Dist features, with their bias towards one aspectual class, depends on the verb type. It is an open research question which verb types should be treated in which way.
We have described a new, context-aware approach to automatically predicting aspectual class, including a new set of distributional features. We have also introduced two new data sets of clauses labeled for aspectual class. Our experiments show that in any setting where labeled training data is available, improvement over the most frequent class baseline can only be reached by integrating instance-based features, though type-based features (LingInd, Dist) may provide useful priors for some verbs and successfully predict predominant aspectual class for unseen verb types. In order to arrive at a globally well-performing system, we envision a multi-stage approach, treating verbs differently according to whether training data is available and whether or not the verb’s aspectual class distribution is highly skewed.
We thank the anonymous reviewers, Omri Abend, Mike Lewis, Manfred Pinkal, Mark Steedman, Stefan Thater and Bonnie Webber for helpful comments, and our annotators A. Kirkland and R. Kühn. This research was supported in part by the MMCI Cluster of Excellence, and the first author is supported by an IBM PhD Fellowship.