We present a novel technique for semantic frame identification using distributed representations of predicates and their syntactic context; this technique leverages automatic syntactic parses and a generic set of word embeddings. Given labeled data annotated with frame-semantic parses, we learn a model that projects the set of word representations for the syntactic context around a predicate to a low dimensional representation. The latter is used for semantic frame identification; with a standard argument identification method inspired by prior work, we achieve state-of-the-art results on FrameNet-style frame-semantic analysis. Additionally, we report strong results on PropBank-style semantic role labeling in comparison to prior work.
Distributed representations of words have proved useful for a number of tasks. By providing richer representations of meaning than what can be encompassed in a discrete representation, such approaches have successfully been applied to tasks such as sentiment analysis [24], topic classification [16] or word-word similarity [20].
We present a new technique for semantic frame identification that leverages distributed word representations. According to the theory of frame semantics [12], a semantic frame represents an event or scenario, and possesses frame elements (or semantic roles) that participate in the event. Most work on frame-semantic parsing has usually divided the task into two major subtasks: frame identification, namely the disambiguation of a given predicate to a frame, and argument identification (or semantic role labeling), the analysis of words and phrases in the sentential context that satisfy the frame’s semantic roles [8, 7].11There are exceptions, wherein the task has been modeled using a pipeline of three classifiers that perform frame identification, a binary stage that classifies candidate arguments, and argument identification on the filtered candidates [1, 15]. Here, we focus on the first subtask of frame identification for given predicates; we use our novel method (§3) in conjunction with a standard argument identification model (§4) to perform full frame-semantic parsing.
We present experiments on two tasks. First, we show that for frame identification on the FrameNet corpus [2, 11], we outperform the prior state of the art [7]. Moreover, for full frame-semantic parsing, with the presented frame identification technique followed by our argument identification method, we report the best results on this task to date. Second, we present results on PropBank-style semantic role labeling [22, 19, 21], that approach strong baselines, and are on par with prior state of the art [23].
Early work in frame-semantic analysis was pioneered by Gildea and Jurafsky (2002). Subsequent work in this area focused on either the FrameNetor PropBankframeworks, and research on the latter has been more popular. Since the CoNLL 2004-2005 shared tasks [4, 5] on PropBank semantic role labeling (SRL), it has been treated as an important NLP problem. However, research has mostly focused on argument analysis, skipping the frame disambiguation step, and its interaction with argument identification.
Closely related to SRL, frame-semantic parsing consists of the resolution of predicate sense into a frame, and the analysis of the frame’s arguments. Work in this area exclusively uses the FrameNet full text annotations. Johansson and Nugues (2007) presented the best performing system at SemEval 2007 [1], and Das et al. (2010) improved performance, and later set the current state of the art on this task [7]. We briefly discuss FrameNet, and subsequently PropBank annotation conventions here.
FrameNet The FrameNetproject [2] is a lexical database that contains information about words and phrases (represented as lemmas conjoined with a coarse part-of-speech tag) termed as lexical units, with a set of semantic frames that they could evoke. For each frame, there is a list of associated frame elements (or roles, henceforth), that are also distinguished as core or non-core.22Additional information such as finer distinction of the coreness properties of roles, the relationship between frames, and that of roles are also present, but we do not leverage that information in this work. Sentences are annotated using this universal frame inventory. For example, consider the pair of sentences in Figure 1(a). Commerce_buy is a frame that can be evoked by morphological variants of the two example lexical units buy.V and sell.V. Buyer, Seller and Goods are some example roles for this frame.
PropBank The PropBankproject [22] is another popular resource related to semantic role labeling. The PropBankcorpus has verbs annotated with sense frames and their arguments. Like FrameNet, it also has a lexical database that stores type information about verbs, in the form of sense frames and the possible semantic roles each frame could take. There are modifier roles that are shared across verb frames, somewhat similar to the non-core roles in FrameNet. Figure 1(b) shows annotations for two verbs “bought” and “sold”, with their lemmas (akin to the lexical units in FrameNet) and their verb frames buy.01 and sell.01. Generic core role labels (of which there are seven, namely A0-A5 and AA) for the verb frames are marked in the figure.33NomBank [19] is a similar resource for nominal predicates, but we do not consider it in our experiments. A key difference between the two annotation systems is that PropBankuses a local frame inventory, where frames are predicate-specific. Moreover, role labels, although few in number, take specific meaning for each verb frame. Figure 1 highlights this difference: while both sell.v and buy.v are members of the same frame in FrameNet, they evoke different frames in PropBank. In spite of this difference, nearly identical statistical models could be employed for both frameworks.
Modeling In this paper, we model the frame-semantic parsing problem in two stages: frame identification and argument identification. As mentioned in §1, these correspond to a frame disambiguation stage,44For example in PropBank, the lexical unit buy.V has three verb frames and in sentential context, we want to disambiguate its frame. (Although PropBank never formally uses the term lexical unit, we adopt its usage from the frame semantics literature.) and a stage that finds the various arguments that fulfill the frame’s semantic roles within the sentence, respectively. This resembles the framework of Das et al. (2014), who solely focus on FrameNet corpora, unlike this paper. The novelty of this paper lies in the frame identification stage (§3). Note that this two-stage approach is unusual for the PropBank corpora when compared to prior work, where the vast majority of published papers have not focused on the verb frame disambiguation problem at all, only focusing on the role labeling stage (see the overview paper of Màrquez et al. (2008) for example).
We present a model that takes word embeddings as input and learns to identify semantic frames. A word embedding is a distributed representation of meaning where each word is represented as a vector in . Such representations allow a model to share meaning between similar words, and have been used to capture semantic, syntactic and morphological content [6, 25, inter alia]. We use word embeddings to represent the syntactic context of a particular predicate instance as a vector. For example, consider the sentence “He runs the company.” The predicate runs has two syntactic dependents – a subject and direct object (but no prepositional phrases or clausal complements). We could represent the syntactic context of runs as a vector with blocks for all the possible dependents warranted by a syntactic parser; for example, we could assume that positions in the vector correspond to the subject dependent, correspond to the clausal complement dependent, and so forth. Thus, the context is a vector in with the embedding of He at the subject position, the embedding of company in direct object position and zeros everywhere else. Given input vectors of this form for our training data, we learn a matrix that maps this high dimensional and sparse representation into a lower dimensional space. Simultaneously, the model learns an embedding for all the possible labels (i.e. the frames in a given lexicon). At inference time, the predicate-context is mapped to the low dimensional space, and we choose the nearest frame label as our classification. We next describe this model in detail.
[25][t]
We continue using the example sentence from §2.2: “He runs the company.” where we want to disambiguate the frame of runs in context. First, we extract the words in the syntactic context of runs; next, we concatenate their word embeddings as described in §2.2 to create an initial vector space representation. Subsequently, we learn a mapping from this initial representation into a low-dimensional space; we also learn an embedding for each possible frame label in the same low-dimensional space. The goal of learning is to make sure that the correct frame label is as close as possible to the mapped context, while competing frame labels are farther away.
Formally, let represent the actual sentence with a marked predicate, along with the associated syntactic parse tree; let our initial representation of the predicate context be . Suppose that the word embeddings we start with are of dimension . Then is a function from a parsed sentence to , where is the number of possible syntactic context types. For example selects some important positions relative to the predicate, and reserves a block in its output space for the embedding of words found at that position. Suppose considers clausal complements and direct objects. Then and for the example sentence it has zeros in positions and the embedding of the word company in positions .
Section 3.1 describes the context positions we use in our experiments. Let the low dimensional space we map to be and the learned mapping be . The mapping is a linear transformation, and we learn it using the Wsabiealgorithm [29]. Wsabiealso learns an embedding for each frame label (, henceforth). In our setting, this means that each frame corresponds to a point in . If we have possible frames we can store those parameters in an matrix, one -dimensional point for each frame, which we will refer to as the linear mapping . Let the lexical unit (the lemma conjoined with a coarse POS tag) for the marked predicate be . We denote the frames that associate with in the frame lexicon55 The frame lexicon stores the frames, corresponding semantic roles and the lexical units associated with the frame. and our training corpus as . Wsabieperforms gradient-based updates on an objective that tries to minimize the distance between and the embedding of the correct label , while maintaining a large distance between and the other possible labels in the confusion set . At disambiguation time, we use a simple dot product similarity as our distance metric, meaning that the model chooses a label by computing the where , where the argmax iterates over the possible frames if was seen in the lexicon or the training data, or , if it was unseen.66 This disambiguation scheme is similar to the one adopted by Das et al. (2014), but they use unlemmatized words to define their confusion set. Model learning is performed using the margin ranking loss function as described in Weston et al. (2011), and in more detail in section 3.2.
Since Wsabielearns a single mapping from to , parameters are shared between different words and different frames. So for example “He runs the company” could help the model disambiguate “He owns the company.” Moreover, since relies on word embeddings rather than word identities, information is shared between words. For example “He runs the company” could help us to learn about “She runs a corporation”.
In principle could be any feature function, but we performed an initial investigation of two particular variants. In both variants, our representation is a block vector where each block corresponds to a syntactic position relative to the predicate, and each block’s values correspond to the embedding of the word at that position.
Direct Dependents The first context function we considered corresponds to the examples in §3. To elaborate, the positions of interest are the labels of the direct dependents of the predicate, so is the number of labels that the dependency parser can produce. For example, if the label on the edge between runs and He is nsubj, we would put the embedding of He in the block corresponding to nsubj. If a label occurs multiple times, then the embeddings of the words below this label are averaged.
Unfortunately, using only the direct dependents can miss a lot of useful information. For example, topicalization can place discriminating information farther from the predicate. Consider “He runs the company.” vs. “It was the company that he runs.” In the second sentence, the discriminating word, company dominates the predicate runs. Similarly, predicates in embedded clauses may have a distant agent which cannot be captured using direct dependents. Consider “The athlete ran the marathon.” vs. “The athlete prepared himself for three months to run the marathon.” In the second example, for the predicate run, the agent The athlete is not a direct dependent, but is connected via a longer dependency path.
Dependency Paths To capture more relevant context, we developed a second context function as follows. We scanned the training data for a given task (either the PropBank or the FrameNet domains) for the dependency paths that connected the gold predicates to the gold semantic arguments. This set of dependency paths were deemed as possible positions in the initial vector space representation. In addition, akin to the first context function, we also added all dependency labels to the context set. Thus for this context function, the block cardinality was the sum of the number of scanned gold dependency path types and the number of dependency labels. Given a predicate in its sentential context, we therefore extract only those context words that appear in positions warranted by the above set. See Figure 3 for an illustration of this process.
We performed initial experiments using context extracted from 1) direct dependents, 2) dependency paths, and 3) both. For all our experiments, setting 3) which concatenates the direct dependents and dependency path always dominated the other two, so we only report results for this setting.
We model our objective function following Weston et al. (2011), using a weighted approximate-rank pairwise loss, learned with stochastic gradient descent. The mapping from to the low dimensional space is a linear transformation, so the model parameters to be learnt are the matrix as well as the embedding of each possible frame label, represented as another matrix where there are frames in total. The training objective function minimizes:
where are the training inputs and their corresponding correct frames, and are negative frames, is the margin. Here, is the rank of the positive frame relative to all the negative frames:
and converts the rank to a weight. Choosing for any positive constant optimizes the mean rank, whereas a weighting such as (adopted here) optimizes the top of the ranked list, as described in [26]. To train with such an objective, stochastic gradient is employed. For speed the computation of is then replaced with a sampled approximation: sample items until a violation is found, i.e. and then approximate the rank with , see Weston et al. (2011) for more details on this procedure. For the choices of the stochastic gradient learning rate, margin () and dimensionality (), please refer to §5.4-§5.5.
Note that an alternative approach could learn only the matrix , and then use a -nearest neighbor classifier in , as in Weinberger and Saul (2009). The advantage of learning an embedding for the frame labels is that at inference time we need to consider only the set of labels for classification rather than all training examples. Additionally, since we use a frame lexicon that gives us the possible frames for a given predicate, we usually only consider a handful of candidate labels. If we used all training examples for a given predicate for finding a nearest-neighbor match at inference time, we would have to consider many more candidates, making the process very slow.
starting word of | POS of the starting word of |
ending word of | POS of the ending word of |
head word of | POS of the head word of |
bag of words in | bag of POS tags in |
a bias feature | voice of the predicate use |
word cluster of ’s head | |
word cluster of ’s head conjoined with word cluster of the predicate | |
dependency path between ’s head and the predicate | |
the set of dependency labels of the predicate’s children | |
dependency path conjoined with the POS tag of ’s head | |
dependency path conjoined with the word cluster of ’s head | |
position of with respect to the predicate (before, after, overlap or identical) | |
whether the subject of the predicate is missing (missingsubj) | |
missingsubj, conjoined with the dependency path | |
missingsubj, conjoined with the dependency path from the verb dominating the predicate to ’s head |
Here, we briefly describe the argument identification model used in our frame-semantic parsing experiments, post frame identification. Given , the sentence with a marked predicate, the argument identification model assumes that the predicate frame has been disambiguated. From a frame lexicon, we look up the set of semantic roles that associate with . This set also contains the null role . From , a rule-based candidate argument extraction algorithm extracts a set of spans that could potentially serve as the overt77 By overtness, we mean the non-null instantiation of a semantic role in a frame-semantic parse. arguments for (see §5.4-§5.5 for the details of the candidate argument extraction algorithms).
Learning Given training data of the form , where,
(1) |
a set of tuples that associates each role in with a span according to the gold data. Note that this mapping associates spans with the null role as well. We optimize the following log-likelihood to train our model:
where is a log-linear model normalized over the set , with features described in Table 1. We set and use L-BFGS [18] for training.
Inference Although our learning mechanism uses a local log-linear model, we perform inference globally on a per-frame basis by applying hard structural constraints. Following Das et al. (2014) and Punyakanok et al. (2008) we use the log-probability of the local classifiers as a score in an integer linear program (ILP) to assign roles subject to hard constraints described in §5.4 and §5.5. We use an off-the-shelf ILP solver for inference.
In this section, we present our experiments and the results achieved. We evaluate our novel frame identification approach in isolation and also conjoined with argument identification resulting in full frame-semantic structures; before presenting our model’s performance we first focus on the datasets, baselines and the experimental setup.
We evaluate our models on both FrameNet- and PropBank-style structures. For FrameNet, we use the full-text annotations in the FrameNet 1.5 release88See https://framenet.icsi.berkeley.edu. which was used by §3.2]das-etal-semafor-2013. We used the same test set as Das et al. containing 23 documents with 4,458 predicates. Of the remaining 55 documents, 16 documents were randomly chosen for development.99These documents are listed in appendix A.
For experiments with PropBank, we used the Ontonotes corpus [14], version 4.0, and only made use of the Wall Street Journal documents; we used sections 2-21 for training, section 24 for development and section 23 for testing. This resembles the setup used by Punyakanok et al. (2008). All the verb frame files in Ontonotes were used for creating our frame lexicon.
For comparison, we implemented a set of baseline models, with varying feature configurations. The baselines use a log-linear model that models the following probability at training time:
(2) |
At test time, this model chooses the best frame as where argmax iterates over the possible frames if was seen in the lexicon or the training data, or , if it was unseen, like the disambiguation scheme of §3. We train this model by maximizing regularized log-likelihood, using L-BFGS; the regularization constant was set to 0.1 in all experiments.
For comparison with our model from §3, which we call Wsabie Embedding, we implemented two baselines with the log-linear model. Both the baselines use features very similar to the input representations described in §3.1. The first one computes the direct dependents and dependency paths as described in §3.1 but conjoins them with the word identity rather than a word embedding. Additionally, this model uses the un-conjoined words as backoff features. This would be a standard NLP approach for the frame identification problem, but is surprisingly competitive with the state of the art. We call this baseline Log-Linear Words. The second baseline, tries to decouple the Wsabietraining from the embedding input, and trains a log linear model using the embeddings. So the second baseline has the same input representation as Wsabie Embedding but uses a log-linear model instead of Wsabie. We call this model Log-Linear Embedding.
Semafor Lexicon | Full Lexicon | ||||||
---|---|---|---|---|---|---|---|
Development Data | Model | All | Ambiguous | Rare | All | Ambiguous | Rare |
Log-Linear Words | 96.21 | 90.41 | 95.75 | 96.37 | 90.41 | 96.07 | |
Log-Linear Embedding | 96.06 | 90.56 | 95.38 | 96.19 | 90.49 | 95.70 | |
Wsabie Embedding (§3) | 96.90 | 92.73 | 96.44 | 96.99 | 93.12 | 96.39 |
Semafor Lexicon | Full Lexicon | |||||||
Model | All | Ambiguous | Rare | Unseen | All | Ambiguous | Rare | |
Test Data | Das et al. (2014) supervised | 82.97 | 69.27 | 80.97 | 23.08 | |||
Das et al. (2014) best | 83.60 | 69.19 | 82.31 | 42.67 | ||||
Log-Linear Words | 84.71 | 70.97 | 81.70 | 27.27 | 87.44 | 70.97 | 87.10 | |
Log-Linear Embedding | 83.42 | 68.70 | 80.95 | 27.97 | 86.20 | 68.70 | 86.03 | |
Wsabie Embedding (§3) | 86.58 | 73.67 | 85.04 | 44.76 | 88.73 | 73.67 | 89.38 |
Semafor Lexicon | Full Lexicon | ||||||
Model | Precision | Recall | Precision | Recall | |||
Development Data | Log-Linear Words | 89.43 | 75.98 | 82.16 | 89.41 | 76.05 | 82.19 |
Wsabie Embedding (§3) | 89.89 | 76.40 | 82.59 | 89.94 | 76.27 | 82.54 | |
Das et al. supervised | 67.81 | 60.68 | 64.05 | ||||
Das et al. best | 68.33 | 61.14 | 64.54 | ||||
Log-Linear Words | 71.16 | 63.56 | 67.15 | 73.35 | 65.27 | 69.08 | |
Wsabie Embedding (§3) | 72.79 | 64.95 | 68.64 | 74.44 | 66.17 | 70.06 |
We process our PropBank and FrameNet training, development and test corpora with a shift-reduce dependency parser that uses the Stanford conventions [9] and uses an arc-eager transition system with beam size of 8; the parser and its features are described by Zhang and Nivre (2011). Before parsing the data, it is tagged with a POS tagger trained with a conditional random field [17] with the following emission features: word, the word cluster, word suffixes of length 1, 2 and 3, capitalization, whether it has a hyphen, digit and punctuation. Beyond the bias transition feature, we have two cluster features for the left and right words in the transition. We use Brown clusters learned using the algorithm of Uszkoreit and Brants (2008) on a large English newswire corpus for cluster features. We use the same word clusters for the argument identification features in Table 1.
We learn the initial embedding representations for our frame identification model (§3) using a deep neural language model similar to the one proposed by Bengio et al. (2003). We use 3 hidden layers each with 1024 neurons and learn a 128-dimensional embedding from a large corpus containing over 100 billion tokens. In order to speed up learning, we use an unnormalized output layer and a hinge-loss objective. The objective tries to ensure that the correct word scores higher than a random incorrect word, and we train with minibatch stochastic gradient descent.
Hyperparameters For our frame identification model with embeddings, we search for the Wsabiehyperparameters using the development data. We search for the stochastic gradient learning rate in , the margin and the dimensionality of the final vector space , to maximize the frame identification accuracy of ambiguous lexical units; by ambiguous, we imply lexical units that appear in the training data or the lexicon with more than one semantic frame. The underlined values are the chosen hyperparameters used to analyze the test data.
Argument Candidates The candidate argument extraction method used for the FrameNet data, (as mentioned in §4) was adapted from the algorithm of Xue and Palmer (2004) applied to dependency trees. Since the original algorithm was designed for verbs, we added a few extra rules to handle non-verbal predicates: we added 1) the predicate itself as a candidate argument, 2) the span ranging from the sentence position to the right of the predicate to the rightmost index of the subtree headed by the predicate’s head; this helped capture cases like “a few months” (where few is the predicate and months is the argument), and 3) the span ranging from the leftmost index of the subtree headed by the predicate’s head to the position immediately before the predicate, for cases like “your gift to Goodwill” (where to is the predicate and your gift is the argument).1010 Note that Das et al. (2014) describe the state of the art in FrameNet-based analysis, but their argument identification strategy considered all possible dependency subtrees in a parse, resulting in a much larger search space.
Frame Lexicon In our experimental setup, we scanned the XML files in the “frames” directory of the FrameNet 1.5 release, which lists all the frames, the corresponding roles and the associated lexical units, and created a frame lexicon to be used in our frame and argument identification models. We noted that this renders every lexical unit as seen; in other words, at frame disambiguation time on our test set, for all instances, we only had to score the frames in for a predicate with lexical unit (see §3 and §5.2). We call this setup Full Lexicon. While comparing with prior state of the art on the same corpus, we noted that Das et al. (2014) found several unseen predicates at test time.1111 Instead of using the frame files, Das et al. built a frame lexicon from FrameNet’s exemplars and the training corpus. For fair comparison, we took the lexical units for the predicates that Das et al. considered as seen, and constructed a lexicon with only those; training instances, if any, for the unseen predicates under Das et al.’s setup were thrown out as well. We call this setup Semafor Lexicon.1212We got Das et al.’s seen predicates from the authors. We also experimented on the set of unseen instances used by Das et al.
ILP constraints For FrameNet, we used three ILP constraints during argument identification (§4). 1) each span could have only one role, 2) each core role could be present only once, and 3) all overt arguments had to be non-overlapping.
Hyperparameters As in §5.4, we made a hyperparameter sweep in the same space. The chosen learning rate was , while the other values were and . Ambiguous lexical units were used for this selection process.
Argument Candidates For PropBankwe use the algorithm of Xue and Palmer (2004) applied to dependency trees.
Frame Lexicon For the PropBankexperiments we scanned the frame files for propositions in Ontonotes 4.0, and stored possible core roles for each verb frame. The lexical units were simply the verb associating with the verb frames. There were no unseen verbs at test time.
ILP constraints We used the constraints of Punyakanok et al. (2008).
Model | All | Ambiguous | Rare |
Log-Linear Words | 94.21 | 90.54 | 93.33 |
Log-Linear Embedding | 93.81 | 89.86 | 93.73 |
Wsabie Embedding (§3) | 94.79 | 91.52 | 92.55 |
Dev data Test data | |||
Model | All | Ambiguous | Rare |
Log-Linear Words | 94.74 | 92.07 | 91.32 |
Log-Linear Embedding | 94.04 | 90.95 | 90.97 |
Wsabie Embedding (§3) | 94.56 | 91.82 | 90.62 |
Model | P | R | |
Log-Linear Words | 80.02 | 75.58 | 77.74 |
Wsabie Embedding (§3) | 80.06 | 75.74 | 77.84 |
Dev data Test data | |||
Model | P | R | |
Log-Linear Words | 81.55 | 77.83 | 79.65 |
Wsabie Embedding (§3) | 81.32 | 77.97 | 79.61 |
Model | P | R | |
Log-Linear Words | 77.29 | 71.50 | 74.28 |
Wsabie Embedding (§3) | 77.13 | 71.32 | 74.11 |
Dev data Test data | |||
Model | P | R | |
Log-Linear Words | 79.47 | 75.11 | 77.23 |
Wsabie Embedding (§3) | 79.36 | 75.04 | 77.14 |
Punyakanok et al. Collins | 75.92 | 71.45 | 73.62 |
Punyakanok et al. Charniak | 77.09 | 75.51 | 76.29 |
Punyakanok et al. Combined | 80.53 | 76.94 | 78.69 |
Table 2 presents accuracy results on frame identification.1313We do not report partial frame accuracy that has been reported by prior work. We present results on all predicates, ambiguous predicates seen in the lexicon or the training data, and rare ambiguous predicates that appear times in the training data. The Wsabie Embedding model from §3 performs significantly better than the Log-Linear Words baseline, while Log-Linear Embedding underperforms in every metric. For the Semafor Lexicon setup, we also compare with the state of the art from Das et al. (2014), who used a semi-supervised learning method to improve upon a supervised latent-variable log-linear model. For unseen predicates from the Das et al. system, we perform better as well. Finally, for the Full Lexicon setting, the absolute accuracy numbers are even better for our best model. Table 3 presents results on the full frame-semantic parsing task (measured by a reimplementation of the SemEval 2007 shared task evaluation script) when our argument identification model (§4) is used after frame identification. We notice similar trends as in Table 2, and our results outperform the previously published best results, setting a new state of the art.
Table 4 shows frame identification results on the PropBankdata. On the development set, our best model performs with the highest accuracy on all and ambiguous predicates, but performs worse on rare ambiguous predicates. On the test set, the Log-Linear Words baseline performs best by a very narrow margin. See §6 for a discussion.
Table 5 presents results where we measure precision, recall and for frames and arguments together; this strict metric penalizes arguments for mismatched frames, like in Table 3. We see the same trend as in Table 4. Finally, Table 6 presents SRL results that measures argument performance only, irrespective of the frame; we use the evaluation script from CoNLL 2005 [5]. We note that with a better frame identification model, our performance on SRL improves in general. Here, too, the embedding model barely misses the performance of the best baseline, but we are at par and sometimes better than the single parser setting of a state-of-the-art SRL system [23].1414The last row of Table 6 refers to a system which used the combination of two syntactic parsers as input.
For FrameNet, the Wsabie Embedding model we propose strongly outperforms the baselines on all metrics, and sets a new state of the art. We believe that the Wsabie Embedding model performs better than the Log-Linear Embedding baseline (that uses the same input representation) because the former setting allows examples with different labels and confusion sets to share information; this is due to the fact that all labels live in the same label space, and a single projection matrix is shared across the examples to map the input features to this space. Consequently, the Wsabie Embedding model can share more information between different examples in the training data than the Log-Linear Embedding model. Since the Log-Linear Words model always performs better than the Log-Linear Embedding model, we conclude that the primary benefit does not come from the input embedding representation.1515One could imagine training a Wsabiemodel with word features, but we did not perform this experiment.
On the PropBankdata, we see that the Log-Linear Words baseline has roughly the same performance as our model on most metrics: slightly better on the test data and slightly worse on the development data. This can be partially explained with the significantly larger training set size for PropBank, making features based on words more useful. Another important distinction between PropBankand FrameNetis that the latter shares frames between multiple lexical units. The effect of this is clearly observable from the “Rare” column in Table 4. Wsabie Embedding performs poorly in this setting while Log-Linear Embedding performs well. Part of the explanation has to do with the specifics of Wsabietraining. Recall that the Wsabie Embedding model needs to estimate the label location in for each frame. In other words, it must estimate 512 parameters based on at most 10 training examples. However, since the input representation is shared across all frames, every other training example from all the lexical units affects the optimal estimate, since they all modify the joint parameter matrix . By contrast, in the log-linear models each label has its own set of parameters, and they interact only via the normalization constant. The Log-Linear Words model does not have this entanglement, but cannot share information between words. For PropBank, these drawbacks and benefits balance out and we see similar performance for Log-Linear Words and Log-Linear Embedding. For FrameNet, estimating the label embedding is not as much of a problem because even if a lexical unit is rare, the potential frames can be frequent. For example, we might have seen the Sending frame many times, even though telex.V is a rare lexical unit.
In comparison to prior work on FrameNet, even our baseline models outperform the previous state of the art. A particularly interesting comparison is between our Log-Linear Words baseline and the supervised model of Das et al. (2014). They also use a log-linear model, but they incorporate a latent variable that uses WordNet [10] to get lexical-semantic relationships and smooths over frames for ambiguous lexical units. It is possible that this reduces the model’s power and causes it to over-generalize. Another difference is that when training the log-linear model, they normalize over all frames, while we normalize over the allowed frames for the current lexical unit. This would tend to encourage their model to expend more of its modeling power to rule out possibilities that will be pruned out at test time.
We have presented a simple model that outperforms the prior state of the art on FrameNet-style frame-semantic parsing, and performs at par with one of the previous-best single-parser systems on PropBankSRL. Unlike Das et al. (2014), our model does not rely on heuristics to construct a similarity graph and leverage WordNet; hence, in principle it is generalizable to varying domains, and to other languages. Finally, we presented results on PropBank-style semantic role labeling with a system that included the task of automatic verb frame identification, in tune with the FrameNet literature; we believe that such a system produces more interpretable output, both from the perspective of human understanding as well as downstream applications, than pipelines that are oblivious to the verb frame, only focusing on argument analysis.
We thank Emily Pitler for comments on an early draft, and the anonymous reviewers for their valuable feedback.
Number | Filename |
---|---|
dev-1 | LUCorpus-v0.3__20000420_xin_eng-NEW.xml |
dev-2 | NTI__SouthAfrica_Introduction.xml |
dev-3 | LUCorpus-v0.3__CNN_AARONBROWN_ENG_20051101_215800.partial-NEW.xml |
dev-4 | LUCorpus-v0.3__AFGP-2002-600045-Trans.xml |
dev-5 | PropBank__TicketSplitting.xml |
dev-6 | Miscellaneous__Hijack.xml |
dev-7 | LUCorpus-v0.3__artb_004_A1_E1_NEW.xml |
dev-8 | NTI__WMDNews_042106.xml |
dev-9 | C-4__C-4Text.xml |
dev-10 | ANC__EntrepreneurAsMadonna.xml |
dev-11 | NTI__LibyaCountry1.xml |
dev-12 | NTI__NorthKorea_NuclearOverview.xml |
dev-13 | LUCorpus-v0.3__20000424_nyt-NEW.xml |
dev-14 | NTI__WMDNews_062606.xml |
dev-15 | ANC__110CYL070.xml |
dev-16 | LUCorpus-v0.3__CNN_ENG_20030614_173123.4-NEW-1.xml |