A Step-wise Usage-based Method for Inducing
Polysemy-aware Verb Classes

Daisuke Kawahara

{}^{\dagger}

Daniel W. Peterson

{}^{{\ddagger}}

Martha Palmer

{}^{{\ddagger}}

{}^{\dagger}

Kyoto University, Kyoto, Japan

{}^{\ddagger}

University of Colorado at Boulder, Boulder, CO, USA
dk@i.kyoto-u.ac.jp, {Daniel.W.Peterson, Martha.Palmer}@colorado.edu

Abstract

We present an unsupervised method for inducing verb classes from verb uses in giga-word corpora. Our method consists of two clustering steps: verb-specific semantic frames are first induced by clustering verb uses in a corpus and then verb classes are induced by clustering these frames. By taking this step-wise approach, we can not only generate verb classes based on a massive amount of verb uses in a scalable manner, but also deal with verb polysemy, which is bypassed by most of the previous studies on verb clustering. In our experiments, we acquire semantic frames and verb classes from two giga-word corpora, the larger comprising 20 billion words. The effectiveness of our approach is verified through quantitative evaluations based on polysemy-aware gold-standard data.

1 Introduction

A verb plays a primary role in conveying the meaning of a sentence. Capturing the sense of a verb is essential for natural language processing (NLP), and thus lexical resources for verbs play an important role in NLP.

Verb classes are one such lexical resource. Manually-crafted verb classes have been developed, such as Levin’s classes [16] and their extension, VerbNet [12], in which verbs are organized into classes on the basis of their syntactic and semantic behavior. Such verb classes have been used in many NLP applications that need to consider semantics in particular, such as word sense disambiguation [4], semantic parsing [41, 33] and discourse parsing [37].

There have also been many attempts to automatically acquire verb classes with the goal of either adding frequency information to an existing resource or of inducing similar verb classes for other languages. Most of these approaches assume that all target verbs are monosemous [36, 32, 9, 18, 38, 39, 45, 26, 27, 7, 19, 29, 40]. This monosemous assumption, however, is not realistic because many frequent verbs actually have multiple senses. Moreover, to the best of our knowledge, none of the following approaches attempt to quantitatively evaluate soft clusterings of verb classes induced by polysemy-aware unsupervised approaches [14, 15, 17, 31].

In this paper, we propose an unsupervised method for inducing verb classes that is aware of verb polysemy. Our method consists of two clustering steps: verb-specific semantic frames are first induced by clustering verb uses in a corpus and then verb classes are induced by clustering these frames. By taking this step-wise approach, we can not only induce verb classes with frequency information from a massive amount of verb uses in a scalable manner, but also deal with verb polysemy.

Our novel contributions are summarized as follows:

•

induce both semantic frames and verb classes from a massive amount of verb uses by a scalable method,
•

explicitly deal with verb polysemy,
•

discover effective features for each of the clustering steps, and
•

quantitatively evaluate a soft clustering of verbs.

Figure 1: Overview of our two-step approach. Verb-specific semantic frames are first induced from verb uses (lower part) and then verb classes are induced from the semantic frames (upper part). The labels of verb classes are manually assigned here for better understanding.

2 Related Work

As stated in Section 1, most of the previous studies on verb clustering assume that verbs are monosemous. A typical method in these studies is to represent each verb as a single data point and apply classification (e.g., Joanis et al. (2008)) or clustering (e.g., Sun and Korhonen (2009)) to these data points. As a representation for a data point, distributions of subcategorization frames are often used, and other semantic features (e.g., selectional preferences) are sometimes added to improve the performance.

Among these studies on monosemous verb clustering (i.e., predominant class induction), there have been several Bayesian methods. Vlachos et al. (2009) proposed a Dirichlet process mixture model (DPMM; Neal (2000)) to cluster verbs based on subcategorization frame distributions. They evaluated their result with a gold-standard test set, where a single class is assigned to a verb. Parisien and Stevenson (2010) proposed a hierarchical Dirichlet process (HDP; Teh et al. (2006)) model to jointly learn argument structures (subcategorization frames) and verb classes by using syntactic features. Parisien and Stevenson (2011) extended their model by adding semantic features. They tried to account for verb learning by children and did not evaluate the resultant verb classes. Modi et al. (2012) extended the model of Titov and Klementiev (2012), which is an unsupervised model for inducing semantic roles, to jointly induce semantic roles and frames across verbs using the Chinese Restaurant Process [1]. All of the above methods considered verbs to be monosemous and did not deal with verb polysemy. Our approach also uses Bayesian methods, but is designed to capture verb polysemy.

We summarize a few studies that consider polysemy of verbs in the rest of this section.

Miyao and Tsujii (2009) proposed a supervised method that can handle verb polysemy. Their method represents a verb’s syntactic and semantic features, and learns a log-linear model from the SemLink corpus [20]. Boleda et al. (2007) also proposed a supervised method for Catalan adjectives considering the polysemy of adjectives.

The most closely related work to our polysemy-aware task of unsupervised verb class induction is the work of Korhonen et al. (2003), who used distributions of subcategorization frames to cluster verbs. They adopted the Nearest Neighbor (NN) and Information Bottleneck (IB) methods for clustering. In particular, they tried to consider verb polysemy by using the IB method, which is a soft clustering method [43]. However, the verb itself is still represented as a single data point. After performing soft clustering, they noted that most verbs fell into a single class, and they decided to assign a single class to each verb by hardening the clustering. They considered multiple classes only in the gold-standard data used for their evaluations. We also evaluate our induced verb classes on this gold-standard data, which was created on the basis of Levin’s classes [16].

Lapata and Brew (2004) and Li and Brew (2007) proposed probabilistic models for calculating prior probabilities of verb classes for a verb. These models are approximated to condition not on verbs but on subcategorization frames. As mentioned in Li and Brew (2007), it is desirable to extend the model to depend on verbs to further improve accuracy. They conducted several evaluations including predominant class induction and token-level verb sense disambiguation, but did not evaluate multiple classes output by their models. Schulte im Walde et al. (2008) also applied probabilistic soft clustering to verbs by incorporating subcategorization frames and selectional preferences based on WordNet. This model is based on the Expectation-Maximization algorithm and the Minimum Description Length principle. Since they focused on the incorporation of selectional preferences, they did not evaluate verb classes but evaluated only selectional preferences using a language model-based measure.

Materna proposed LDA-frames, which are defined across verbs and can be considered to be a kind of verb class [21, 22]. LDA-frames are probabilistic semantic frames automatically induced from a raw corpus. He used a model based on latent Dirichlet allocation (LDA; Blei et al. (2003)) and the Dirichlet process to cluster verb instances of a triple (subject, verb, object) to produce semantic frames and roles. Both of these are represented as a probabilistic distribution of words across verbs. He applied this method to the BNC and acquired 1,200 frames and 400 roles [21]. He did not evaluate the resulting frames as verb classes.

In sum, there have been no studies that quantitatively evaluate polysemous verb classes automatically induced by unsupervised methods.

3 Our Approach

3.1 Overview

Our objective is to automatically learn semantic frames and verb classes from a massive amount of verb uses following usage-based approaches. Although Bayesian approaches are a possible solution to simultaneously induce frames and verb classes from a corpus as used in previous studies, it has prohibitive computational cost. For instance, Parisien and Stevenson applied HDP only to a small-scale child speech corpus that contains 170K verb uses to jointly induce subcategorization frames and verb classes [26, 27]. Materna applied an LDA-based method to the BNC, which contains 1.4M verb uses, to induce semantic frames across verbs that can be considered to be verb classes [21, 22]. However, it would take three months for this experiment using this 100 million word corpus.¹¹In our replication experiment, it took a week to perform 70 iterations using Materna’s code and an Intel Xeon E5-2680 (2.7GHz) CPU. To reach 1,000 iterations, which are reported to be optimum, it would take three months. Although it is best to use the largest possible corpus for this kind of knowledge acquisition tasks [30], it is infeasible to scale to giga-word corpora using such joint models.

In this paper, we propose a two-step approach for inducing semantic frames and verb classes. First, we make multiple data points for each verb to deal with verb polysemy (cf. polysemy-aware previous studies still represented a verb as one data point [14, 23]). To do that, we induce verb-specific semantic frames by clustering verb uses. Then, we induce verb classes by clustering these verb-specific semantic frames across verbs. An interesting point here is that we can use exactly the same method for these two clustering steps.

Our procedure to automatically induce verb classes from verb uses is summarized as follows:

1.

induce verb-specific semantic frames by clustering predicate-argument structures for each verb extracted from automatic parses as shown in the lower part of Figure 1, and
2.

induce verb classes by clustering the induced semantic frames across verbs as shown in the upper part of Figure 1.

Each of these two steps is described in the following sections in detail.

3.2 Inducing Verb-specific Semantic Frames

We induce verb-specific semantic frames from verb uses based on the method of Kawahara et al. (2014). Our semantic frames consist of case slots, each of which consists of word instances that can be filled. The procedure for inducing these semantic frames is as follows:

1.

apply dependency parsing to a raw corpus and extract predicate-argument structures for each verb from the automatic parses,
2.

merge the predicate-argument structures that have presumably the same meaning based on the assumption of one sense per collocation [46] to get a set of initial frames, and
3.

apply clustering to the initial frames based on the Chinese Restaurant Process [1] to produce verb-specific semantic frames.

These three steps are briefly described below.

3.2.1 Extracting Predicate-argument Structures from a Raw Corpus

We apply dependency parsing to a large raw corpus. We use the Stanford parser with Stanford dependencies [5].²²http://nlp.stanford.edu/software/lex-parser.shtml Collapsed dependencies are adopted to directly extract prepositional phrases.

Then, we extract predicate-argument structures from the dependency parses. Dependents that have the following dependency relations to a verb are extracted as arguments:

nsubj, xsubj, dobj, iobj, ccomp, xcomp, prep_ $*$

In this process, the verb and arguments are lemmatized, and only the head of an argument is preserved for compound nouns.

Predicate-argument structures are collected for each verb and the subsequent processes are applied to the predicate-argument structures of each verb.

3.2.2 Constructing Initial Frames from Predicate-argument Structures

To make the computation feasible, we merge the predicate-argument structures that have the same or similar meaning to get initial frames. These initial frames are the input of the subsequent clustering process. For this merge, we assume one sense per collocation [46] for predicate-argument structures.

For each predicate-argument structure of a verb, we couple the verb and an argument to make a unit for sense disambiguation. We select an argument in the following order by considering the degree of effect on the verb sense:³³If a predicate-argument structure has multiple prepositional phrases, one of them is randomly selected.

dobj, ccomp, nsubj, prep_ $*$ , iobj.

Then, the predicate-argument structures that have the same verb and argument pair (slot and word, e.g., “dobj:effect”) are merged into an initial frame. After this process, we discard minor initial frames that occur fewer than 10 times.

3.2.3 Clustering Method

We cluster initial frames for each verb to produce semantic frames using the Chinese Restaurant Process [1], regarding each initial frame as an instance.

We calculate the posterior probability of a cluster $c_{j}$ given an initial frame $f_{i}$ as follows:

P(c_{j}|f_{i})\propto\begin{cases}\frac{n(c_{j})}{N+\alpha}\cdot P(f_{i}|c_{j}% )&c_{j}\neq new\\ \frac{\alpha}{N+\alpha}\cdot P(f_{i}|c_{j})&c_{j}=new,\end{cases}

(1)

where $N$ is the number of initial frames for the target verb and $n(c_{j})$ is the current number of initial frames assigned to the cluster $c_{j}$ . $\alpha$ is a hyper-parameter that determines how likely it is for a new cluster to be created. In this equation, the first term is the Dirichlet process prior and the second term is the likelihood of $f_{i}$ .

$P(f_{i}|c_{j})$ is defined based on the Dirichlet-Multinomial distribution as follows:

P(f_{i}|c_{j})=\prod_{w\in V}P(w|c_{j})^{count(f_{i},w)},

(2)

where $V$ is the vocabulary in all case slots cooccurring with the verb and $count(f_{i},w)$ is the number of $w$ in the initial frame $f_{i}$ . The original method in Kawahara et al. (2014) defined $w$ as pairs of slots and words, e.g., “nsubj:child” and “dobj:bird,” but does not consider slot-only features, e.g., “nsubj” and “dobj,” which ignore lexical information. Here we experiment with both representations and compare the results.

$P(w|c_{j})$ is defined as follows:

P(w|c_{j})=\frac{count(c_{j},w)+\beta}{\sum_{t\in V}count(c_{j},t)+|V|\cdot% \beta},

(3)

where $count(c_{j},w)$ is the current number of $w$ in the cluster $c_{j}$ , and $\beta$ is a hyper-parameter of Dirichlet distribution. For a new cluster, this probability is uniform ( $1/|V|$ ).

We regard each output cluster as a semantic frame, by merging the initial frames in a cluster into a semantic frame. In this way, semantic frames for each verb are acquired.

We use Gibbs sampling to realize this clustering.

3.3 Inducing Verb Classes from Semantic Frames

To induce verb classes across verbs, we apply clustering to the induced verb-specific semantic frames. We can use exactly the same clustering method as described in Section 3.2.3 by using semantic frames for multiple verbs as an input instead of initial frames for a single verb. This is because an initial frame has the same structure as a semantic frame, which is produced by merging initial frames. We regard each output cluster as a verb class this time.

For the features, $w$ , in equation (2), we try the two representations again: slot-only features and slot-word pair features. The representation using only slots corresponds to the consideration of only syntactic argument patterns. The other representation using the slot-word pairs means that semantic similarity based on word overlap is naturally considered by looking at lexical information. We will compare in our experiments four possible combinations: two feature representations for each of the two clustering steps.

4 Experiments and Evaluations

We first describe our experimental settings and define evaluation metrics to evaluate induced soft clusterings of verb classes. Then, we conduct type-level multi-class evaluations, type-level single-class evaluations and token-level multi-class evaluations. These two levels of evaluations are performed by considering the work of Reichart et al. (2010) on clustering evaluation. Finally, we discuss the results of our full experiments.

4.1 Experimental Settings

We use two kinds of large-scale corpora: a web corpus and the English Gigaword corpus.

To prepare a web corpus, we extracted sentences from crawled web pages that are judged to be written in English based on the encoding information. Then, we selected sentences that consist of at most 40 words, and removed duplicated sentences. From this process, we obtained a corpus of one billion sentences, totaling approximately 20 billion words. We focused on verbs whose frequency in the web corpus was more than 1,000. There were 19,649 verbs, including phrasal verbs, and separating passive and active constructions. We extracted 2,032,774,982 predicate-argument structures.

We also used the English Gigaword corpus (LDC2011T07; English Gigaword Fifth Edition). This corpus consists of approximately 180 million sentences, which totaling four billion words. There were 7,356 verbs after applying the same frequency threshold as the web corpus. We extracted 423,778,278 predicate-argument structures from this corpus.

We set the hyper-parameters $\alpha$ in (1) and $\beta$ in (3) to 1.0. The cluster assignments for all the components were initialized randomly. We took 100 samples for each input frame and selected the cluster assignment that has the highest probability.

4.2 Evaluation Metrics

To measure the precision and recall of a clustering, modified purity and inverse purity (also called collocation or weighted class accuracy) are commonly used in previous studies on verb clustering (e.g., Sun and Korhonen (2009)). However, since these measures are only applicable to a hard clustering, it is necessary to extend them to be applicable to a soft clustering, because in our task a verb can belong to multiple clusters or classes.⁴⁴Korhonen et al. (2003) evaluated hard clusterings based on a gold standard with multiple classes per verb. They reported only precision measures including modified purity, and avoided extending the evaluation metrics for soft clusterings. We propose a normalized version of modified purity and inverse purity. This kind of normalization for soft clusterings was performed for other evaluation metrics as in Springorum et al. (2013).

To measure the precision of a clustering, a normalized version of modified purity is defined as follows. Suppose $K$ is the set of automatically induced clusters and $G$ is the set of gold classes. Let $K_{i}$ be the verb vector of the $i$ -th cluster and $G_{j}$ be the verb vector of the $j$ -th gold class. Each component of these vectors is a normalized frequency, which equals a cluster/class attribute probability given a verb. Where there is no frequency information available for class distribution, such as the gold-standard data described in Section 4.3, we use a uniform distribution across the verb’s classes. The core idea of purity is that each cluster $K_{i}$ is associated with its most prevalent gold class. In addition, to penalize clusters that consist of only one verb, such singleton clusters in $K$ are considered as errors, as is usual with modified purity. The normalized modified purity (nmPU) can then be written as follows:

	$\displaystyle\textrm{nmPU}=\frac{1}{N}\sum_{i\ \textrm{s.t.}\ \|K_{i}\|>1}\max_{% j}\delta_{K_{i}}({K_{i}\cap G_{j}}),$		(4)
	$\displaystyle\delta_{K_{i}}({K_{i}\cap G_{j}})=\sum_{v\in K_{i}\cap G_{j}}c_{% iv},$		(5)

where $N$ denotes the total number of verbs, $|K_{i}|$ denotes the number of positive components in $K_{i}$ , and $c_{iv}$ denotes the $v$ -th component of $K_{i}$ . $\delta_{K_{i}}({K_{i}\cap G_{j}})$ means the total mass of the set of verbs in $K_{i}\cap G_{j}$ , given by summing up the values in $K_{i}$ . In case of evaluating a hard clustering, this is equal to $|K_{i}\cap G_{j}|$ because all the values of $c_{iv}$ are equal to 1.

As usual, the following normalized inverse purity (niPU) is used to measure the recall of a clustering:

\displaystyle\textrm{niPU}=\frac{1}{N}\sum_{j}\max_{i}\delta_{G_{j}}({K_{i}% \cap G_{j}}).

(6)

Finally, we use the harmonic mean (F ${}_{1}$ ) of nmPU and niPU as a single measure of clustering quality.

verb	classes	verb	classes
place	9	drop	9, 45, 004, 47, 51, A54, A30
dye	24, 21, 41		9, 45, 004, 47, 51, A54, A30
focus	31, 45	bake	26, 45
stare	30	persuade	002
lay	9	sparkle	43
build	26, 45	pour	9, 43, 26, 57, 13, 31
force	002, 11		9, 43, 26, 57, 13, 31
glow	43	invent	26, 27

Table 1: An excerpt of the gold-standard verb classes for several verbs from Korhonen et al. (2003). The classes starting with ‘0’ were derived from the LCS database, those starting with ‘A’ were defined by Korhonen et al., and the other classes were from Levin’s classes. A bolded class is the predominant class for each verb.

method	K	nmPU	niPU	F ${}_{1}$
IB ( $k$ =35, $t$ =0.10)	35.0	53.59	51.44	52.44
IB ( $k$ =35, $t$ =0.05)	35.0	53.67	52.62	53.10
IB ( $k$ =35, $t$ =0.02)	35.0	54.42	54.43	54.40
IB ( $k$ =35, $t$ =0.01)	35.0	54.60	55.54	55.04
IB ( $k$ =42, $t$ =0.10)	41.6	55.42	49.46	52.24
IB ( $k$ =42, $t$ =0.05)	41.8	55.55	49.97	52.59
IB ( $k$ =42, $t$ =0.02)	42.0	56.19	51.24	53.58
IB ( $k$ =42, $t$ =0.01)	42.0	56.80	51.92	54.24
LDA-frames ( $t$ =0.10)	100	47.52	56.83	51.76
LDA-frames ( $t$ =0.05)	165	50.46	67.94	57.91
LDA-frames ( $t$ =0.02)	306	49.98	75.50	60.14
LDA-frames ( $t$ =0.01)	458	49.55	82.71	61.97
Gigaword/S-S	272.8	63.46	67.66	65.49
Gigaword/S-SW	36.4	31.49	95.70	47.38
Gigaword/SW-S	186.2	63.52	64.18	63.84
Gigaword/SW-SW	30.0	36.27	94.66	52.40
web/S-S	363.6	61.32	78.64	68.90
web/S-SW	52.2	35.80	99.30	52.62
web/SW-S	212.2	66.26	77.38	71.39
web/SW-SW	55.0	36.70	96.25	53.13

Table 2: Type-level multi-class evaluations. K represents the (average) number of induced classes. “S” denotes the use of slot-only features and “SW” denotes the use of slot-word pair features. For example, “SW-S” means that slot-word pair features are used for semantic frame induction and slot-only features are used for verb class induction.

4.3 Type-level Multi-class Evaluations

We first evaluate our induced verb classes on the test set created by Korhonen et al. (2003) (Table 1 of their paper) which was created by considering verb polysemy on the basis of Levin’s classes and the LCS database [6]. It consists of 62 classes and 110 verbs, out of which 35 verbs are monosemous and 75 verbs are polysemous. The average number of verb classes per verb is 2.24. An excerpt from this data is shown in Table 1.

		predominant class eval			multiple class eval
method	K	mPU	iPU	F ${}_{1}$	mPU	niPU	F ${}_{1}$
NN	24	46.36	52.73	49.34	52.73	46.85	49.62
IB ( $k$ =35)	34.8	42.73	51.82	46.82	51.64	46.83	49.09
IB ( $k$ =42)	41.0	47.45	50.91	49.11	55.27	45.45	49.87
LDA-frames	53	30.00	47.27	36.71	41.82	44.28	43.01
Gigaword/S	9.6	25.64	71.27	37.70	32.91	64.71	43.62
Gigaword/SW	10.6	30.36	71.09	42.25	39.82	66.92	49.70
web/S	20.4	42.73	61.46	50.31	54.91	57.12	55.86
web/SW	11.8	34.36	71.82	46.40	49.09	67.01	56.50

Table 3: Type-level single-class evaluations against predominant/multiple classes. K represents the (average) number of induced classes.

As our baselines, we adopt two previously proposed methods. We first implemented a soft clustering method for verb class induction proposed by Korhonen et al. (2003). They used the information bottleneck (IB) method for assigning probabilities of classes to each verb. Note that Korhonen et al. (2003) actually hardened the clusterings and left the evaluations of soft clusterings for their future work. For input data, we employ VALEX [13], which is a publicly-available large-scale subcategorization lexicon.⁵⁵http://ilexir.co.uk/applications/valex/ By following the method of Korhonen et al. (2003), prepositional phrases (pp) are parameterized for two frequent subcategorization frames (NP and NP_PP), and the unfiltered raw frequencies of subcategorization frames are used as features to represent a verb. It is necessary to specify the number of clusters, $k$ , for the IB method beforehand, and we adopt 35 and 42 clusters according to their reported high accuracies. To output multiple classes for each verb, we set a threshold, $t$ , for class attribute probabilities. That is, classes that have a higher class attribute probability than the threshold are output for each verb. We report the results of the following threshold values: 0.01, 0.02, 0.05 and 0.10.

The other baseline is LDA-frames [21]. We use the induced LDA-frames that are available on the web site.⁶⁶http://nlp.fi.muni.cz/projekty/lda-frames/ This frame data was induced from the BNC and consists of 1,200 frames and 400 semantic roles. Again, we set a threshold for frame attribute probabilities.

We report results using our methods with four feature combinations (slot-only (S) and slot-word pair (SW) features each used for both the frame-generation and verb-class clustering steps) for both the Gigaword and web corpora. Table 2 lists evaluation results for the baseline methods and our methods.⁷⁷Although we do not think that the classes with very small attribute probabilities are meaningful, the F ${}_{1}$ scores for lower thresholds than 0.01 converged to about 66 in the case of LDA-frames. The results of the IB baseline and our methods are obtained by averaging five runs.

We can see that “web/SW-S” achieved the best performance and obtained a higher F ${}_{1}$ than the baselines by more than nine points. “Web/SW-S” uses the combination of slot-word pair features for clustering verb-specific frames and slot-only features for clustering across verbs. Interestingly, this result indicates that slot distributions are more effective than lexical information in slot-word pairs for inducing verb classes similar to the gold standard. This result is consistent with expectations, given a gold standard based on Levin’s verb classes, which are organized according to the syntactic behavior of verbs. The use of slot-word pairs for verb class induction generally merged too many frames into each class, apparently due to accidental word overlaps across verbs.

The verb classes induced from the web corpus achieved a higher F ${}_{1}$ than those from the Gigaword corpus. This can be attributed to the larger size of the web corpus. The employment of this kind of huge corpus is enabled by our scalable method.

4.4 Type-level Single-class Evaluations against Predominant/Multiple Classes

Since we focus on the handling of verb polysemy, predominant class induction for each verb is not our main objective. However, we wish to compare our method with previous work on the induction of a predominant (monosemous) class for each verb.

To output a single class for each verb by using our proposed method, we skip the induction of verb-specific semantic frames and instead create a single frame for each verb by merging all predicate-argument structures of the verb. Then, we apply clustering to these frames across verbs. For clustering features, we again compare two representations: slot-only features (S) and slot-word pair features (SW).

We evaluate the single-class output for each verb based on the predominant gold-standard classes, which are defined for each verb in the test set of Korhonen et al. (2003). This data contains 110 verbs and 33 classes. We evaluate these single-class outputs in the same manner as Korhonen et al. (2003), using the gold standard with multiple classes, which we also use for our multi-class evaluations.

As we did with the multi-class evaluations, we adopt modified purity (mPU), inverse purity (iPU) and their harmonic mean (F ${}_{1}$ ) as the metrics for the evaluation with predominant classes. It is not necessary to normalize these metrics when we treat verbs as monosemous, and evaluate against the predominant sense. When we evaluate against the multiple classes in the gold standard, we do normalize the inverse purity.

For baselines, we once more adopt the Nearest Neighbor (NN) and Information Bottleneck (IB) methods proposed by Korhonen et al. (2003), and LDA-frames proposed by Materna (2012). The clusterings with the NN and IB methods are obtained by using the VALEX subcategorization lexicon. To harden the clusterings of the IB method and the LDA-frames, the class with the highest probability is selected for each verb. This hardening process is exactly the same as Korhonen et al. (2003). Note that our results of the NN and IB methods are different from those reported in their paper since the data source is different.⁸⁸Korhonen et al. (2003) reported that the highest modified purity was 49% against predominant classes and 60% against multiple classes.

Table 3 lists accuracies of baseline methods and our methods. Our proposed method using the web corpus achieved comparable performance with the baseline methods on the predominant class evaluation and outperformed them on the multiple class evaluation. More sophisticated methods for predominant class induction, such as the method of Sun and Korhonen (2009) using selectional preferences, could produce better single-class outputs, but have difficulty in producing polysemy-aware verb classes.

From the result, we can see that the induced verb classes based on slot-only features did not achieve a higher F ${}_{1}$ than those based on slot-word pair features in many cases. This result is different from that of multi-class evaluations in Section 4.3. We speculate that slot distributions are not so different among verbs when all uses of a verb are merged into one frame, and thus their discrimination power is lower than that in the intermediate construction of semantic frames.

4.5 Token-level Multi-class Evaluations

We conduct token-level multi-class evaluations using 119 verbs, which appear 100 or more times in sections 02-21 of the SemLink WSJ corpus. These 119 verbs cover 102 VerbNet classes, and 48 of them are polysemous in the sense of being in more than one VerbNet class. Each instance of these 119 verbs in this corpus belongs to one of 102 VerbNet classes. We first add these instances to the instances from a raw corpus and apply the two-step clustering to these merged instances. Then, we compare the induced verb classes of the SemLink instances with their gold-standard VerbNet classes. We report the values of modified purity (mPU), inverse purity (iPU) and their harmonic mean (F ${}_{1}$ ). It is not necessary to normalize these metrics because the clustering of these instances is hard.

For clustering features, we compare two feature combinations: “S-S” and “SW-S,” which achieved high performance in the type-level multi-class evaluations (Section 4.3). The results of these methods are obtained by averaging five runs. For a baseline, we use verb-specific semantic frames without clustering across verbs (“S-NIL” and “SW-NIL”), where these frames are considered to be verb classes but not shared across verbs. Table 4 lists accuracies of these methods for the two corpora. We can see that “SW-S” achieved a higher F ${}_{1}$ than “S-S” and the baselines without verb class induction (“S-NIL” and “SW-NIL”).

Modi et al. (2012) induced semantic frames across verbs using the monosemous assumption and reported an F ${}_{1}$ of 44.7% (77.9% PU and 31.4% iPU) for the assignment of FrameNet frames to the FrameNet corpus. We also conducted the above evaluation against FrameNet frames for 75 verbs.⁹⁹Since FrameNet frames are not assigned to all verbs of SemLink, the number of verbs is different from the evaluations against VerbNet classes. We achieved an F ${}_{1}$ of 62.79% (66.97% mPU and 59.09% iPU) for “web/SW-S,” and an F ${}_{1}$ of 60.06% (65.58% mPU and 55.39% iPU) for “Gigaword/SW-S.” It is difficult to directly compare these results with Modi et al. (2012), but our induced verb classes seem to have higher F ${}_{1}$ accuracy.

method	K	mPU	iPU	F ${}_{1}$
Gigaword/S-NIL	–	93.43	20.06	33.03
Gigaword/SW-NIL	–	94.45	41.07	57.25
Gigaword/S-S	512.2	75.06	45.26	56.47
Gigaword/SW-S	260.6	73.98	56.45	64.04
web/S-NIL	–	93.70	32.96	48.76
web/SW-NIL	–	94.51	44.95	60.92
web/S-S	500.0	72.25	52.48	60.79
web/SW-S	255.2	72.65	61.00	66.31

Table 4: Token-level evaluations against VerbNet classes. K represents the average number of induced classes.

4.6 Full Experiments and Discussions

We finally induce verb classes from the semantic frames of 1,667 verbs, which appear at least once in sections 02-21 of the WSJ corpus. Based on the best results in the above evaluations, we induced semantic frames using slot-word pair features, and then induced verb classes using slot-only features. We ended with 38,481 semantic frames and 699 verb classes from the Gigaword corpus, and 61,903 semantic frames and 840 verb classes from the web corpus. It took two days to induce verb classes from the Gigaword corpus and three days from the web corpus.

Examples of verb classes and semantic frames induced from the web corpus are shown in Table 5 and Table 6. While there are many classes with consistent meanings, such as “Class 4” and “Class 16,” some classes have mixed meanings. For instance, “Class 2” consists of the semantic frames “need:2” and “say:2.” These frames were merged due to the high syntactic similarity of constituting slot distributions, which are comprised of a subject and a sentential complement. To improve the quality of verb classes, it is necessary to develop a clustering model that can consider syntactic and lexical similarity in a balanced way.

class	semantic frames
Class 1	rave:1, talk:1
Class 2	need:2, say:2
Class 3	smell:1, sound:1
Class 4	concentrate:1, focus:1
Class 5	express:2, inquire:62, voice:1
Class 6	revolve:1, snake:2, wrap:2
Class 7	hand:1, hand:3, hand:4
Class 8	depend:1, rely:1, rely:3
Class 9	collaborate:1, compete:2, work:1
Class 10	coach:3, teach:3, teach:4
Class 11	dance:1, react:1, stick:1
Class 12	advise:8, express:4, quiz:10, voice:2
Class 13	give:18, grant:6, offer:11, offer:12
Class 14	keep:14, keep:18, stay:4, stay:488
Class 15	cuff:5, fasten:2, tie:1, tie:4
Class 16	arrange:3, book:4, make:27, reserve:5
Class 17	deport:6, differ:1, fluctuate:1, vary:1
Class 18	peek:1, peek:3, peer:1, peer:7, …
Class 19	groan:1, growl:1, hiss:1, moan:1, purr:1
Class 20	inform:1, notify:2, remind:1, beware:1, …

Table 5: Examples of induced verb classes. Underlined semantic frames are shown in Table 6.

5 Conclusion

We presented a step-wise unsupervised method for inducing verb classes from instances in giga-word corpora. This method first clusters predicate-argument structures to induce verb-specific semantic frames and then clusters these semantic frames across verbs to induce verb classes. Both clustering steps are performed with exactly the same method, which is based on the Chinese Restaurant Process. The resulting semantic frames and verb classes are open to the public and also can be searched via our web interface.¹⁰¹⁰http://nlp.ist.i.kyoto-u.ac.jp/member/kawahara/cf/crp.en/

	slot	instance words
	nsubj	you:2150273, i:7678, we:4599, …
need:2	ccomp	$\langle$ s $\rangle$ :2193321
	nsubj	she:1705781, he:20693, i:9422, …
say:2	ccomp	$\langle$ s $\rangle$ :1829616
	nsubj	i:11100, he:10323, we:6373, …
	dobj	me:30646, you:27678, us:21642, …
inform:1	prep_of	decision:846, this:759, situation:688, …
	$\vdots$
	nsubj	we:7505, you:3439, i:1035, …
	dobj	you:18604, us:7281, them:3649, …
notify:2	prep_of	change:1540, problem:496, status:386, …
	$\vdots$

Table 6: Examples of induced semantic frames. The number following an instance word denotes its frequency and

\langle

\rangle

denotes a sentential complement.

From the results, we can see that the combination of the slot-word pair features for clustering verb-specific frames and the slot-only features for clustering across verbs is the most effective and outperforms the baselines by approximately 10 points. This indicates that slot distributions are more effective than lexical information in slot-word pairs for the induction of verb classes, when Levin-style classes are used for evaluation. This is consistent with Levin’s principle of organizing verb classes according to the syntactic behavior of verbs.

As applications of the resulting semantic frames and verb classes, we plan to integrate them into syntactic parsing, semantic role labeling and verb sense disambiguation. For instance, Kawahara and Kurohashi (2006) improved accuracy of dependency parsing based on Japanese semantic frames automatically induced from a raw corpus. It is also valuable and promising to apply the induced verb classes to NLP applications as used in metaphor identification [34] and argumentative zoning [8].

Acknowledgments

This work was supported by Kyoto University John Mung Program and JST CREST. We also gratefully acknowledge the support of the National Science Foundation Grant NSF-IIS-1116782, A Bayesian Approach to Dynamic Lexical Resources for Flexible Language Processing. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References

[1] D. Aldous(1985) Exchangeability and related topics. École d’Été de Probabilités de Saint-Flour XIIIâ1983, pp. 1–198. Cited by: 3., 2, 3.2.3.
[2] D. M. Blei, A. Y. Ng and M. I. Jordan(2003) Latent Dirichlet allocation. the Journal of Machine Learning Research 3, pp. 993–1022. Cited by: 2.
[3] G. Boleda, S. S. im Walde and T. Badia(2007) Modelling polysemy in adjective classes by multi-label classification.. pp. 171–180. Cited by: 2.
[4] H. T. Dang(2004) Investigations into the role of lexical semantics in word sense disambiguation. Ph.D. Thesis, University of Pennsylvania. Cited by: 1.
[5] M. de Marneffe, B. MacCartney and C. D. Manning(2006) Generating typed dependency parses from phrase structure parses. pp. 449–454. Cited by: 3.2.1.
[6] B. J. Dorr(1997) Large-scale dictionary construction for foreign language tutoring and interlingual machine translation. Machine Translation 12 (4), pp. 271–322. Cited by: 4.3.
[7] I. Falk, C. Gardent and J. Lamirel(2012) Classifying French verbs using French and English lexical resources. pp. 854–863. External Links: Link Cited by: 1.
[8] Y. Guo, A. Korhonen and T. Poibeau(2011) A weakly-supervised approach to argumentative zoning of scientific documents. pp. 273–283. External Links: Link Cited by: 5.
[9] E. Joanis, S. Stevenson and D. James(2008) A general feature space for automatic verb classification. Natural Language Engineering 14 (3), pp. 337–367. Cited by: 1, 2.
[10] D. Kawahara and S. Kurohashi(2006) A fully-lexicalized probabilistic model for Japanese syntactic and case structure analysis. pp. 176–183. External Links: Link Cited by: 5.
[11] D. Kawahara, D. W. Peterson, O. Popescu and M. Palmer(2014) Inducing example-based semantic frames from a massive amount of verb uses. Cited by: 3.2.3, 3.2.
[12] K. Kipper-Schuler(2005) VerbNet: a broad-coverage, comprehensive verb lexicon. Ph.D. Thesis, University of Pennsylvania. Cited by: 1.
[13] A. Korhonen, Y. Krymolowski and T. Briscoe(2006) A large subcategorization lexicon for natural language processing applications. pp. 345–352. Cited by: 4.3.
[14] A. Korhonen, Y. Krymolowski and Z. Marx(2003) Clustering polysemic subcategorization frame distributions semantically. pp. 64–71. External Links: Link Cited by: 1, 2, 3.1, 4.2, 4.3, 4.3, 4.4, 4.4, 1.
[15] M. Lapata and C. Brew(2004) Verb class disambiguation using informative priors. Computational Linguistics 30 (1), pp. 45–73. Cited by: 1, 2.
[16] B. Levin(1993) English verb classes and alternations: a preliminary investigation. The University of Chicago Press. Cited by: 1, 2.
[17] J. Li and C. Brew(2007) Disambiguating Levin verbs using untagged data. Cited by: 1, 2.
[18] J. Li and C. Brew(2008) Which are the best features for automatic verb classification. pp. 434–442. External Links: Link Cited by: 1.
[19] T. Lippincott, A. Korhonen and D. Ó Séaghdha(2012) Learning syntactic verb frames using graphical models. pp. 420–429. External Links: Link Cited by: 1.
[20] E. Loper, S. Yi and M. Palmer(2007) Combining lexical resources: mapping between PropBank and VerbNet. Cited by: 2.
[21] J. Materna(2012) LDA-frames: an unsupervised approach to generating semantic frames. pp. 376–387. Cited by: 2, 3.1, 4.3, 4.4.
[22] J. Materna(2013) Parameter estimation for LDA-frames. pp. 482–486. External Links: Link Cited by: 2, 3.1.
[23] Y. Miyao and J. Tsujii(2009) Supervised learning of a probabilistic lexicon of verb semantic classes. pp. 1328–1337. External Links: Link Cited by: 2, 3.1.
[24] A. Modi, I. Titov and A. Klementiev(2012) Unsupervised induction of frame-semantic representations. pp. 1–7. External Links: Link Cited by: 2, 4.5.
[25] R. M. Neal(2000) Markov chain sampling methods for Dirichlet process mixture models. Journal of computational and graphical statistics 9 (2), pp. 249–265. Cited by: 2.
[26] C. Parisien and S. Stevenson(2010) Learning verb alternations in a usage-based Bayesian model. Cited by: 1, 2, 3.1.
[27] C. Parisien and S. Stevenson(2011) Generalizing between form and meaning using learned verb classes. Cited by: 1, 2, 3.1.
[28] R. Reichart, O. Abend and A. Rappoport(2010) Type level clustering evaluation: new measures and a POS induction case study. pp. 77–87. External Links: Link Cited by: 4.
[29] R. Reichart and A. Korhonen(2013) Improved lexical acquisition through DPP-based verb clustering. pp. 862–872. External Links: Link Cited by: 1.
[30] R. Sasano, D. Kawahara and S. Kurohashi(2009) The effect of corpus size on case frame acquisition for discourse analysis. pp. 521–529. External Links: Link Cited by: 3.1.
[31] S. Schulte im Walde, C. Hying, C. Scheible and H. Schmid(2008) Combining EM training and the MDL principle for an automatic verb classification incorporating selectional preferences. pp. 496–504. External Links: Link Cited by: 1, 2.
[32] S. Schulte im Walde(2006) Experiments on the automatic induction of German semantic verb classes. Computational Linguistics 32 (2), pp. 159–194. Cited by: 1.
[33] L. Shi and R. Mihalcea(2005) Putting pieces together: combining FrameNet, VerbNet and WordNet for robust semantic parsing. Computational Linguistics and Intelligent Text Processing, pp. 100–111. Cited by: 1.
[34] E. Shutova, L. Sun and A. Korhonen(2010) Metaphor identification using verb and noun clustering. pp. 1002–1010. External Links: Link Cited by: 5.
[35] S. Springorum, S. Schulte im Walde and J. Utt(2013) Detecting polysemy in hard and soft cluster analyses of German preposition vector spaces. pp. 632–640. External Links: Link Cited by: 4.2.
[36] S. Stevenson and E. Joanis(2003) Semi-supervised verb class discovery using noisy features. pp. 71–78. External Links: Link Cited by: 1.
[37] R. Subba and B. Di Eugenio(2009) An effective discourse parser that uses rich linguistic information. pp. 566–574. External Links: Link Cited by: 1.
[38] L. Sun, A. Korhonen and Y. Krymolowski(2008) Automatic classification of English verbs using rich syntactic features. pp. 769–774. Cited by: 1.
[39] L. Sun and A. Korhonen(2009) Improving verb clustering with automatically acquired selectional preferences. pp. 638–647. External Links: Link Cited by: 1, 2, 4.2, 4.4.
[40] L. Sun, D. McCarthy and A. Korhonen(2013) Diathesis alternation approximation for verb clustering. pp. 736–741. External Links: Link Cited by: 1.
[41] R. Swier and S. Stevenson(2005) Exploiting a verb lexicon in automatic semantic role labelling. pp. 883–890. External Links: Link Cited by: 1.
[42] Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei(2006) Hierarchical Dirichlet processes. Journal of the American Statistical Association 101 (476). Cited by: 2.
[43] N. Tishby, F. C. Pereira and W. Bialek(1999) The information bottleneck method. pp. 368–377. Cited by: 2.
[44] I. Titov and A. Klementiev(2012) A Bayesian approach to unsupervised semantic role induction. pp. 12–22. External Links: Link Cited by: 2.
[45] A. Vlachos, A. Korhonen and Z. Ghahramani(2009) Unsupervised and constrained Dirichlet process mixture models for verb clustering. pp. 74–82. External Links: Link Cited by: 1, 2.
[46] D. Yarowsky(1993) One sense per collocation. pp. 266–271. Cited by: 2., 3.2.2.

Generated on Tue Jun 10 18:26:39 2014 by LaTeXML [LOGO]

A Step-wise Usage-based Method for Inducing Polysemy-aware Verb Classes