Automatic prediction of aspectual class of verbs in context

Annemarie Friedrich and Alexis Palmer
Department of Computational Linguistics
Saarland University, Saarbrücken, Germany
{afried,apalmer}@coli.uni-saarland.de

Abstract

This paper describes a new approach to predicting the aspectual class of verbs in context, i.e., whether a verb is used in a stative or dynamic sense. We identify two challenging cases of this problem: when the verb is unseen in training data, and when the verb is ambiguous for aspectual class. A semi-supervised approach using linguistically-motivated features and a novel set of distributional features based on representative verb types allows us to predict classes accurately, even for unseen verbs. Many frequent verbs can be either stative or dynamic in different contexts, which has not been modeled by previous work; we use contextual features to resolve this ambiguity. In addition, we introduce two new datasets of clauses marked for aspectual class.

1 Introduction

In this work, we focus on the automatic prediction of whether a verb in context is used in a stative or in a dynamic sense, the most fundamental distinction in all taxonomies of aspectual class. The aspectual class of a discourse’s finite verbs is an important factor in conveying and interpreting temporal structure [21, 10, 18]; others are tense, grammatical aspect, mood and whether the utterance represents an event as completed. More accurate temporal information processing is expected to be beneficial for a variety of natural language processing tasks [7, 32].

While most verbs have one predominant interpretation, others are more flexible for aspectual class and can occur as either stative (1) or dynamic (1) depending on the context. There are also cases that allow for both readings, such as (1).

•

The liquid fills the container. (stative)

•

The pool slowly filled with water. (dynamic)

•

Your soul was made to be filled with God Himself. (both) (Brown corpus, religion)

Cases like (1) do not imply that there is a third class, but rather that two interpretations are available for the sentence, of which usually one will be chosen by a reader.

Following Siegel and McKeown (2000), we aim to automatically classify clauses for fundamental aspectual class, a function of the main verb and a select group of complements, which may differ per verb [26, 28]. This corresponds to the aspectual class of the clause’s main verb when ignoring any aspectual markers or transformations. For example, English sentences with perfect tense are usually considered to introduce states to the discourse [29, 17], but we are interested in the aspectual class before this transformation takes place. The clause John has kissed Mary introduces a state, but the fundamental aspectual class of the ‘tenseless’ clause John kiss Mary is dynamic.

In contrast to Siegel and McKeown (2000), we do not conduct the task of predicting aspectual class solely at the type level, as such an approach ignores the minority class of ambiguous verbs. Instead we predict the aspectual class of verbs in the context of their arguments and modifiers. We show that this method works better than using only type-based features, especially for verbs with ambiguous aspectual class. In addition, we show that type-based features, including novel distributional features based on representative verbs, accurately predict predominant aspectual class for unseen verb types. Our work differs from prior work in that we treat the problem as a three-way classification task, predicting dynamic, stative or both as the aspectual class of a verb in context.

2 Related work

Aspectual class is well treated in the linguistic literature [33, 12, 29, for example]. Our notion of the stative/dynamic distinction corresponds to Bach’s [1] distinction between states and non-states; to states versus occurrences (events and processes) according to Mourelatos (1978); and to Vendler’s [33] distinction between states and the other three classes (activities, achievements, accomplishments).

Early studies on the computational modeling of aspectual class [23, 24, 5, 18] laid foundations for a cluster of papers published over a decade ago [26, 28, 27]. Since then, it has mostly been treated as a subtask within temporal reasoning, such as in efforts related to TimeBank [25] and the TempEval challenges [34, 35, 32], where top-performing systems [16, 3, 6] use corpus-based features, WordNet synsets, parse paths and features from typed dependencies to classify events as a joint task with determining the event’s span. Costa and Branco (2012) explore the usefulness of a wider range of explicitly aspectual features for temporal relation classification.

Siegel and McKeown (2000) present the most extensive study of predicting aspectual class, which is the main inspiration for this work. While all of their linguistically motivated features (see section 4.1) are type-based, they train on and evaluate over labeled verbs in context. Their data set taken from medical discharge summaries comprises 1500 clauses containing main verbs other than be and have which are marked for aspectual class. Their model fails to outperform a baseline of memorizing the most frequent class of a verb type, and they present an experiment testing on unseen verb types only for the related task of classifying completedness of events. We replicate their method using publicly available software, create a similar but larger corpus,¹¹Direct comparison on their data is not possible; feature values for the verbs studied are available, but full texts and the English Slot Grammar parser [20] are not. and show that it is indeed possible to predict the aspectual class of unseen verbs. Siegel (1998a) investigates a classification method for the verb have in context; inspired by this work, our present work goes one step further and uses a larger set of instance-based contextual features to perform experiments on a set of 20 verbs. To the best of our knowledge, there is no previous work comprehensively addressing aspectual classification of verbs in context.

3 Data

Verb type seed sets

Using the LCS Database [11], we identify sets of verb types whose senses are only stative (188 verbs, e.g. belong, cost, possess), only dynamic (3760 verbs, e.g. alter, knock, resign), or mixed (215 verbs, e.g. fill, stand, take), following a procedure described by Dorr and Olsen (1997).

Asp-MASC

	complete		w/o have/be/none
genre	clauses	$\kappa$	clauses	$\kappa$
jokes	3462	0.85	2660	0.77
letters	1848	0.71	1444	0.62
news	2565	0.79	2075	0.69
all	7875	0.80	6161	0.70

Table 1: Asp-MASC: Cohen’s observed unweighted

\kappa

	dynamic	stative	both
dynamic	4464	164	9
stative	434	1056	29
both	5	0	0

Table 2: Asp-MASC: confusion matrix for two annotators, without have/be/none clauses,

\kappa

is 0.7.

The Asp-MASC corpus consists of 7875 clauses from the letters, news and jokes sections of MASC [15], each labeled by two annotators for the aspectual class of the main verb.²²Corpus freely available from
www.coli.uni-saarland.de/~afried. Texts were segmented into clauses using SPADE [30] with some heuristic post-processing. We parse the corpus using the Stanford dependency parser [8] and extract the main verb of each segment. We use 6161 clauses for the classification task, omitting clauses with have or be as the main verb and those where no main verb could be identified due to parsing errors (none). Table 1 shows inter-annotator agreement; Table 2 shows the confusion matrix for the two annotators. Our two annotators exhibit different preferences on the 598 cases where they disagree between dynamic and stative. Such differences in annotation preferences are not uncommon [2]. We observe higher agreement in the jokes and news subcorpora than for letters; texts in the letters subcorpora are largely argumentative and thus have a different rhetorical style than the more straightforward narratives and reports found in jokes. Overall, we find substantial agreement.

The data for our experiments uses the label dynamic or stative whenever annotators agree, and both whenever they disagree or when at least one annotator marked the clause as both, assuming that both readings are possible in such cases. Because we don’t want to model the authors’ personal view of the theory, we refrain from applying an adjudication step and model the data as is.

Asp-Ambig: (Brown)

	dynamic	stative	both
dynamic	1444	201	54
stative	168	697	20
both	44	31	8

Table 3: Asp-Ambig: confusion matrix for two annotators. Cohen’s

\kappa

is 0.6.

In order to facilitate a first study on ambiguous verbs, we select 20 frequent verbs from the list of ‘mixed’ verbs (see section 3) and for each mark 138 sentences. Sentences are extracted randomly from the Brown corpus, such that the distribution of stative/dynamic usages is expected to be natural. We present entire sentences to the annotators who mark the aspectual class of the verb in question as highlighted in the sentence. The data is processed in the same way as Asp-MASC, discarding instances with parsing problems. This results in 2667 instances. $\kappa$ is 0.6, the confusion matrix is shown in Table 3. Details are listed in Table 10.

4 Model and Features

For predicting the aspectual class of verbs in context (stative, dynamic, both), we assume a supervised learning setting and explore features mined from a large background corpus, distributional features, and instance-based features. If not indicated otherwise, experiments use a Random Forest classifier [4] trained with the implementation and standard parameter settings from Weka [14].

4.1 Linguistic indicator features (LingInd)

This set of corpus-based features is a reimplementation of the linguistic indicators of Siegel and McKeown (2000), who show that (some of) these features correlate with either stative or dynamic verb types. We parse the AFE and XIE sections of Gigaword [13] with the Stanford dependency parser. For each verb type, we obtain a normalized count showing how often it occurs with each of the indicators in Table 4, resulting in one value per feature per verb. For example, for the verb fill, the value of the feature temporal-adverb is 0.0085, meaning that 0.85% of the occurrences of fill in the corpus are modified by one of the temporal adverbs on the list compiled by Siegel (1998b). Tense, progressive, perfect and voice are extracted using a set of rules following Loaiciga et al. (2014).³³We thank the authors for providing us their code.

Feature	Example	Feature	Example
frequency	-	continuous	continually
present	says	adverb	endlessly
past	said	evaluation	better
future	will say	adverb	horribly
perfect	had won	manner	furiously
progressive	is winning	adverb	patiently
negated	not/never	temporal	again
particle	up/in/…	adverb	finally
no subject	-	in-PP	in an hour
		for-PP	for an hour

Table 4: LingInd feature set and examples for lexical items associated with each indicator.

Feature	Values
part-of-speech tag of the verb	VB, VBG, VBN, …
tense	present, past, future
progressive	true/false
perfect	true/false
voice	active/passive
grammatical dependents	WordNet lexname/POS

Table 5: Instance-based (Inst) features

4.2 Distributional Features (Dist)

We aim to leverage existing, possibly noisy sets of representative stative, dynamic or mixed verb types extracted from LCS (see section 3), making up for unseen verbs and noise by averaging over distributional similarities. Using an existing large distributional model [31] estimated over the set of Gigaword documents marked as stories, for each verb type, we build a syntactically informed vector representing the contexts in which the verb occurs. We compute three numeric feature values per verb type, which correspond to the average cosine similarities with the verb types in each of the three seed sets.

4.3 Instance-based features (Inst)

Table 5 shows our set of instance-based syntactic and semantic features. In contrast to the above described type-based features, these features do not rely on a background corpus, but are extracted from the clause being classified. Tense, progressive, perfect and voice are extracted from dependency parses as described above. For features encoding grammatical dependents, we focus on a subset of grammatical relations. The feature value is either the WordNet lexical filename (e.g. noun.person) of the given relation’s argument or its POS tag, if the former is not available. We simply use the most frequent sense for the dependent’s lemma. We also include features that indicate, if there are any, the particle of the verb and its prepositional dependents. For the sentence A little girl had just finished her first week of school, the instance-based feature values would include tense:past, subj:noun.person, dobj:noun.time or particle:none.

Features	Accuracy (%)
Baseline (Lemma)	$83.6$
LingInd	$83.8$
Inst	$70.8$
Inst+Lemma	$83.7$
Dist	$83.4$
LingInd+Inst+Dist+Lemma	$84.1$

Table 6: Experiment 1: Seen verbs, using Asp-MASC. Baseline memorizes most frequent class per verb type in training folds.

5 Experiments

The experiments presented in this section aim to evaluate the effectiveness of the feature sets described in the previous section, focusing on the challenging cases of verb types unseen in the training data and highly ambiguous verbs. The feature Lemma indicates that the verb’s lemma is used as an additional feature.

Experiment 1: Seen verbs

The setting of our first experiment follows Siegel and McKeown (2000). Table 6 reports results for 10-fold cross-validation, with occurrences of all verbs distributed evenly over the folds. No feature combination significantly⁴⁴According to McNemar’s test with Yates’ correction for continuity, $p<0.01$ . outperforms the baseline of simply memorizing the most frequent class of a verb type in the respective training folds.

Experiment 2: UNSEEN verbs

This experiment shows a successful case of semi-supervised learning: while type-based feature values can be estimated from large corpora in an unsupervised way, some labeled training data is necessary to learn their best combination. This experiment specifically examines performance on verbs not seen in labeled training data. We use 10-fold cross validation but ensure that all occurrences of a verb type appear in the same fold: verb types in each test fold have not been seen in the respective training data, ruling out the Lemma feature. A Logistic regression classifier [14] works better here (using only numeric features), and we present results in Table 7. Both the LingInd and Dist features generalize across verb types, and their combination works best.

	Features	Accuracy (%)
1	Baseline	$72.5$
2	Dist	$78.3*$
3	LingInd	$80.4*$
4	LingInd+Dist	$81.9{\textnormal{*}\dagger}$

Table 7: Experiment 2: Unseen verb types, Logistic regression, Asp-MASC. Baseline labels everything with the most frequent class in the training set (dynamic). *Significantly⁵ different from line 1.

\dagger

Significantly⁵ different from line 3.

Data	Features	Acc. (%)
one-label	Baseline	$92.8$
verbs	LingInd	$92.8$
	Dist	$92.6$
(1966 inst.)	Inst+Lemma	$91.4*$
	LingInd+Inst+Lemma	$92.4$
multi-label	Baseline	$78.9$
verbs	LingInd	$79.0$
	Dist	$79.0$
(4195 inst.)	Inst	$67.4*$
	Inst+Lemma	$79.9$
	LingInd+Inst+Lemma	$80.9{\textnormal{*}}$
	LingInd+Inst+Lemma+Dist	$80.2{\textnormal{*}}$

Table 8: Experiment 3: ‘one- vs. multi-label’ verbs, Asp-MASC. Baseline as in Table 6. *Indicates that result is significantly⁵ different from the respective baseline.

System	Class	Acc.	P	R	F
baseline	micro-avg.	$78.9$	0.75	0.79	0.76
LingInd	dynamic		0.84	0.95	0.89
+Inst	stative		0.76	0.69	0.72
+Lemma	both		0.51	0.24	0.33
	micro-avg.	$80.9{\textnormal{*}}$	0.78	0.81	0.79

Table 9: Experiment 3: ‘multi-label’, precision, recall and F-measure, detailed class statistics for the best-performing system from Table 8.

Experiment 3: one- vs. multi-label verbs

For this experiment, we compute results separately for one-label verbs (those for which all instances in Asp-MASC have the same label) and for multi-label verbs (instances have differing labels in Asp-MASC). We expect one-label verbs to have a strong predominant aspectual class, and multi-label verbs to be more flexible. Otherwise, the experimental setup is as in experiment 1. Results appear in Table 8. In each case, the linguistic indicator features again perform on par with the baseline. For multi-label verbs, the feature combination Lemma+LingInd+Inst leads to significant⁵ improvement of 2% gain in accuracy over the baseline; Table 9 reports detailed class statistics and reveals a gain in F-measure of 3 points over the baseline. To sum up, Inst features are essential for classifying multi-label verbs, and the LingInd features provide some useful prior. These results motivate the need for an instance-based approach.

Experiment 4: Instance-based classification

				Inst +Lemma	Inst +Lemma +LingInd +Dist
	# of	Majority
Verb	inst.	Class¹⁰
feel	128	96.1	stat
say	138	94.9	dyn	93.5	93.5
make	136	91.9	dyn	91.9	91.2
come	133	88.0	dyn	87.2	87.2
take	137	85.4	dyn	85.4	85.4
meet	130	83.9	dyn	86.2	87.7
stand	130	80.0	stat	79.2	83.1
find	137	74.5	dyn	69.3	68.8
accept	134	70.9	dyn	64.9	65.7
hold	134	56.0	both	43.3	49.3
carry	136	55.9	dyn	55.9	58.1
look	138	55.8	dyn	72.5	74.6
show	133	54.9	dyn	69.2	68.4
appear	136	52.2	stat	64.7	61.0
follow	122	51.6	both	69.7	65.6
consider	138	50.7	dyn	61.6	70.3
cover	123	50.4	stat	46.3	54.5
fill	134	47.8	dyn	66.4	62.7
bear	135	47.4	dyn	70.4	67.4
allow	135	37.8	dyn	48.9	51.9
micro-avg.	2667	66.3		71.0*	72.0*

Table 10: Experiment 4: Instance-based. Accuracy (in %) on Asp-Ambig. *Differs significantly⁵ from the majority class baseline.

For verbs with ambiguous aspectual class, type-based classification is not sufficient, as this approach selects a dominant sense for any given verb and then always assigns that. Therefore we propose handling ambiguous verbs separately. As Asp-MASC contains only few instances of each of the ambiguous verbs, we turn to the Asp-Ambig dataset. We perform a Leave-One-Out (LOO) cross validation evaluation, with results reported in Table 10.⁵⁵ The third column also shows the outcome of using either only the Lemma, only LingInd or only Dist in LOO; all have almost the same outcome as using the majority class, numbers differ only after the decimal point. Using the Inst features alone (not shown in Table 10) results in a micro-average accuracy of only 58.1%: these features are only useful when combined with the feature Lemma. For classifying verbs whose most frequent class occurs less than 56% of the time, Lemma+Inst features are essential. Whether or not performance is improved by adding LingInd/Dist features, with their bias towards one aspectual class, depends on the verb type. It is an open research question which verb types should be treated in which way.

6 Discussion and conclusions

We have described a new, context-aware approach to automatically predicting aspectual class, including a new set of distributional features. We have also introduced two new data sets of clauses labeled for aspectual class. Our experiments show that in any setting where labeled training data is available, improvement over the most frequent class baseline can only be reached by integrating instance-based features, though type-based features (LingInd, Dist) may provide useful priors for some verbs and successfully predict predominant aspectual class for unseen verb types. In order to arrive at a globally well-performing system, we envision a multi-stage approach, treating verbs differently according to whether training data is available and whether or not the verb’s aspectual class distribution is highly skewed.

Acknowledgments

We thank the anonymous reviewers, Omri Abend, Mike Lewis, Manfred Pinkal, Mark Steedman, Stefan Thater and Bonnie Webber for helpful comments, and our annotators A. Kirkland and R. Kühn. This research was supported in part by the MMCI Cluster of Excellence, and the first author is supported by an IBM PhD Fellowship.

References

[1] E. Bach(1986) The algebra of events. Linguistics and philosophy 9 (1), pp. 5–16. Cited by: 2.
[2] B. Beigman Klebanov, E. Beigman and D. Diermeier(2008) Analyzing disagreements. pp. 2–7. Cited by: 3.
[3] S. Bethard(2013) ClearTK-TimeML: a minimalist approach to TempEval 2013. Vol. 2, pp. 10–14. Cited by: 2.
[4] L. Breiman(2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: 4.
[5] M. R. Brent(1991) Automatic semantic classification of verbs from their syntactic contexts: an implemented classifier for stativity. pp. 222–226. Cited by: 2.
[6] N. Chambers(2013) Navytime: Event and time ordering from raw text. Vol. 2, pp. 73–77. Cited by: 2.
[7] F. Costa and A. Branco(2012) Aspectual type and temporal relation classification. pp. 266–275. Cited by: 1, 2.
[8] M. De Marneffe, B. MacCartney and C. D. Manning(2006) Generating typed dependency parses from phrase structure parses. Vol. 6, pp. 449–454. Cited by: 3.
[9] B. J. Dorr and M. B. Olsen(1997) Deriving verbal and compositional lexical aspect for NLP applications. pp. 151–158. Cited by: 3.
[10] B. J. Dorr(1992) A two-level knowledge representation for machine translation: lexical semantics and tense/aspect. Lexical Semantics and Knowledge Representation, pp. 269–287. Cited by: 1.
[11] B. J. Dorr(2001) LCS verb database. Online Software Database of Lexical Conceptual Structures University of Maryland, College Park, MD. Cited by: 3.
[12] D. Dowty(1979) Word meaning and montague grammar. Reidel, Dordrecht. Note: ***** Cited by: 2.
[13] Cited by: 4.1.
[14] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten(2009) The weka data mining software: an update. ACM SIGKDD explorations newsletter 11 (1), pp. 10–18. Cited by: 4, 5.
[15] N. Ide, C. Fellbaum, C. Baker and R. Passonneau(2010) The manually annotated sub-corpus: a community resource for and by the people. pp. 68–73. Cited by: 3.
[16] H. Jung and A. Stent(2013) ATT1: Temporal annotation using big windows and rich syntactic and semantic features. Vol. 2, pp. 20–24. Cited by: 2.
[17] G. Katz(2003) On the stativity of the english perfect. Perfect explorations, pp. 205–234. Cited by: 1.
[18] J. L. Klavans and M. Chodorow(1992) Degrees of stativity: the lexical representation of verb aspect. pp. 1126–1131. Cited by: 1, 2.
[19] S. Loaiciga, T. Meyer and A. Popescu-Belis(2014) English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling. Cited by: 4.1.
[20] M. C. McCord(1990) Slot grammar. Springer. Cited by: 2.
[21] M. Moens and M. J. Steedman(1988) Temporal ontology and temporal reference. Computational Linguistics 14 (2), pp. 15–28. Note: ***** External Links: Link Cited by: 1.
[22] A. P.D. Mourelatos(1978) Events, processes, and states. Linguistics and philosophy 2 (3), pp. 415–434. Cited by: 2.
[23] A. Nakhimovsky(1988) Aspect, aspectual class, and the temporal structure of narrative. Computational Linguistics 14 (2), pp. 29–43. Cited by: 2.
[24] R. Passonneau(1988) A computational model of the semantics of tense and aspect. Computational Linguistics Spring 1988. Cited by: 2.
[25] J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day and L. Ferro(2003) The timebank corpus. Vol. 2003, pp. 40. Cited by: 2.
[26] E. V. Siegel and K. R. McKeown(2000) Learning methods to combine linguistic indicators: improving aspectual classification and revealing linguistic insights. Computational Linguistics 26 (4), pp. 595–628. Cited by: 1, 1, 2, 2, 4.1, 5.
[27] E. V. Siegel(1998) Disambiguating verbs with the WordNet category of the direct object. Universite de Montreal. Cited by: 2, 2.
[28] E. V. Siegel(1998) Linguistic indicators for language understanding: using machine learning methods to combine corpus-based indicators for aspectual classification of clauses. Ph.D. Thesis, Columbia University. Cited by: 1, 2, 4.1.
[29] C. S. Smith(1991) The parameter of aspect. Kluwer, Dordrecht. Cited by: 1, 2.
[30] R. Soricut and D. Marcu(2003) Sentence level discourse parsing using syntactic and lexical information. pp. 149–156. Cited by: 3.
[31] S. Thater, H. Fürstenau and M. Pinkal(2011) Word meaning in context: a simple and effective vector model.. pp. 1134–1143. Cited by: 4.2.
[32] N. UzZaman, H. Llorens, L. Derczynski, M. Verhagen, J. Allen and J. Pustejovsky(2013) Semeval-2013 task 1: tempeval-3: evaluating time expressions, events, and temporal relations. Vol. 2, pp. 1–9. Cited by: 1, 2.
[33] Z. Vendler(1957) Linguistics in philosophy. pp. 97–121. Cited by: 2.
[34] M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, G. Katz and J. Pustejovsky(2007) Semeval-2007 task 15: tempeval temporal relation identification. pp. 75–80. Cited by: 2.
[35] M. Verhagen, R. Sauri, T. Caselli and J. Pustejovsky(2010) SemEval-2010 task 13: TempEval-2. pp. 57–62. Cited by: 2.

Generated on Wed Jun 11 17:58:55 2014 by LaTeXML [LOGO]