Much of the recent work on dependency parsing has been focused on solving inherent combinatorial problems associated with rich scoring functions. In contrast, we demonstrate that highly expressive scoring functions can be used with substantially simpler inference procedures. Specifically, we introduce a sampling-based parser that can easily handle arbitrary global features. Inspired by SampleRank, we learn to take guided stochastic steps towards a high scoring parse. We introduce two samplers for traversing the space of trees, Gibbs and Metropolis-Hastings with Random Walk. The model outperforms state-of-the-art results when evaluated on 14 languages of non-projective CoNLL datasets. Our sampling-based approach naturally extends to joint prediction scenarios, such as joint parsing and POS correction. The resulting method outperforms the best reported results on the CATiB dataset, approaching performance of parsing with gold tags.11The source code for the work is available at http://groups.csail.mit.edu/rbg/code/global/acl2014.
Dependency parsing is commonly cast as a maximization problem over a parameterized scoring function. In this view, the use of more expressive scoring functions leads to more challenging combinatorial problems of finding the maximizing parse. Much of the recent work on parsing has been focused on improving methods for solving the combinatorial maximization inference problems. Indeed, state-of-the-art results have been obtained by adapting powerful tools from optimization [16, 17, 27]. We depart from this view and instead focus on using highly expressive scoring functions with substantially simpler inference procedures. The key ingredient in our approach is how learning is coupled with inference. Our combination outperforms the state-of-the-art parsers and remains comparable even if we adopt their scoring functions.
Rich scoring functions have been used for some time. They first appeared in the context of reranking [6], where a simple parser is used to generate a candidate list which is then reranked according to the scoring function. Because the number of alternatives is small, the scoring function could in principle involve arbitrary (global) features of parse trees. The power of this methodology is nevertheless limited by the initial set of alternatives from the simpler parser. Indeed, the set may already omit the gold parse. We dispense with the notion of a candidate set and seek to exploit the scoring function more directly.
In this paper, we introduce a sampling-based parser that places few or no constraints on the scoring function. Starting with an initial candidate tree, our inference procedure climbs the scoring function in small (cheap) stochastic steps towards a high scoring parse. The proposal distribution over the moves is derived from the scoring function itself. Because the steps are small, the complexity of the scoring function has limited impact on the computational cost of the procedure. We explore two alternative proposal distributions. Our first strategy is akin to Gibbs sampling and samples a new head for each word in the sentence, modifying one arc at a time. The second strategy relies on a provably correct sampler for first-order scores [33], and uses it within a Metropolis-Hastings algorithm for general scoring functions. It turns out that the latter optimizes the score more efficiently than the former.
Because the inference procedure is so simple, it is important that the parameters of the scoring function are chosen in a manner that facilitates how we climb the scoring function in small steps. One way to achieve this is to make sure that improvements in the scoring functions are correlated with improvements in the quality of the parse. This approach was suggested in the SampleRank framework [32] for training structured prediction models. This method was originally developed for a sequence labeling task with local features, and was shown to be more effective than state-of-the-art alternatives. Here we apply SampleRank to parsing, applying several modifications such as the proposal distributions mentioned earlier.
The benefits of sampling-based learning go beyond stand-alone parsing. For instance, we can use the framework to correct preprocessing mistakes in features such as part-of-speech (POS) tags. In this case, we combine the scoring function for trees with a stand-alone tagging model. When proposing a small move, i.e., sampling a head of the word, we can also jointly sample its POS tag from a set of alternatives provided by the tagger. As a result, the selected tag is influenced by a broad syntactic context above and beyond the initial tagging model and is directly optimized to improve parsing performance. Our joint parsing-tagging model provides an alternative to the widely-adopted pipeline setup.
We evaluate our method on benchmark multilingual dependency corpora. Our method outperforms the Turbo parser across 14 languages on average by 0.5%. On four languages, we top the best published results. Our method provides a more effective mechanism for handling global features than reranking, outperforming it by 1.3%. In terms of joint parsing and tagging on the CATiB dataset, we nearly bridge (88.38%) the gap between independently predicted (86.95%) and gold tags (88.45%). This is better than the best published results in the 2013 SPMRL shared task [28], including parser ensembles.
Earlier works on dependency parsing focused on inference with tractable scoring functions. For instance, a scoring function that operates over each single dependency can be optimized using the maximum spanning tree algorithm [21]. It was soon realized that using higher order features could be beneficial, even at the cost of using approximate inference and sacrificing optimality. The first successful approach in this arena was reranking [6, 5] on constituency parsing. Reranking can be combined with an arbitrary scoring function, and thus can easily incorporate global features over the entire parse tree. Its main disadvantage is that the output parse can only be one of the few parses passed to the reranker.
Recent work has focused on more powerful inference mechanisms that consider the full search space [34, 27, 14, 12]. For instance, Nakagawa (2007) deals with tractability issues by using sampling to approximate marginals. Another example is the dual decomposition (DD) framework [14, 17]. The idea in DD is to decompose the hard maximization problem into smaller parts that can be efficiently maximized and enforce agreement among these via Lagrange multipliers. The method is essentially equivalent to linear programming relaxation approaches [19, 29], and also similar in spirit to ILP approaches [26].
A natural approach to approximate global inference is via search. For instance, a transition-based parsing system [36] incrementally constructs a parsing structure using greedy beam-search. Other approaches operate over full trees and generate a sequence of candidates that successively increase the score [8, 15, 32]. Our work builds on one such approach — SampleRank [32], a sampling-based learning algorithm. In SampleRank, the parameters are adjusted so as to guide the sequence of candidates closer to the target structure along the search path. The method has been successfully used in sequence labeling and machine translation [11]. In this paper, we demonstrate how to adapt the method for parsing with rich scoring functions.
In this section, we introduce our novel sampling-based dependency parser which can incorporate arbitrary global features. We begin with the notation before addressing the decoding and learning algorithms. Finally, we extend our model to a joint parsing and POS correction task.
We denote sentences by and the corresponding dependency trees by . Here is the set of valid (projective or non-projective) dependency trees for sentence . We use to refer to the th word of sentence , and to the head word of . A training set of size is given as a set of pairs where is the ground truth parse for sentence .
We parameterize the scoring function as
(1) |
where is the feature vector associated with tree for sentence . We do not make any assumptions about how the feature function decomposes. In contrast, most state-of-the-art parsers operate under the assumption that the feature function decomposes into a sum of simpler terms. For example, in the second-order MST parser [23], all the feature terms involve arcs or consecutive siblings. Similarly, parsers based on dual decomposition [17, 14] assume that decomposes into a sum of terms where each term can be maximized over efficiently.
The decoding problem consists of finding a valid dependency tree that maximizes the score with parameters . For scoring functions that extend beyond first-order arc preferences, finding the maximizing non-projective tree is known to be NP-hard [23]. We find a high scoring tree through sampling, and (later) learn the parameters so as to further guide this process.
Our sampler generates a sequence of dependency structures so as to approximate independent samples from
(2) |
The temperature parameter controls how concentrated the samples are around the maximum of (e.g., see Geman and Geman (1984)). Sampling from target distribution is typically as hard as (or harder than) that maximizing . We follow here a Metropolis-Hastings sampling algorithm (e.g., see Andrieu et al. (2003)) and explore different alternative proposal distributions . The distribution governs the small steps that are taken in generating a sequence of structures. The target distribution folds into the procedure by defining the probability that we will accept the proposed move. The general structure of our sampling algorithm is given in Figure 1.
{algorithmic}\STATE
Inputs: , , (initial temperature), (temperature update rate), proposal distribution . Outputs: \STATE
Set to some random tree \REPEAT\STATE
\IF
\STATE
\ENDIF\STATE
\STATE
Sample Bernouli variable with .
\IF
\STATE
\ELSE\STATE
\ENDIF \STATE
\UNTIL
convergence
{algorithmic}\STATE
Inputs: , , , (number of heads to change). Outputs: \TO
\STATE
\ENDFOR\STATESet to true for random nodes. \FOR
\TO \WHILE\NOT
\IF
\STATE
randomHead() \ENDIF\STATE
\IF
LoopExist() EraseLoop() \WHILE\NOT
\STATE
\STATE
\ENDWHILE\ENDFOR\RETURN
Construct tree from the head array.
Perhaps the most natural choice of the proposal distribution is a conditional distribution from . This is feasible if we restrict the proposed moves to only small changes in the current tree. In our case, we choose a word randomly, and then sample its head according to with the constraint that we obtain a valid tree (when projective trees are sought, this constraint is also incorporated). For this choice of , the probability of accepting the new tree ( in Figure 1) is identically one. Thus new moves are always accepted.
One shortcoming of the Gibbs sampler is that it only changes one variable (arc) at a time. This usually leads to slow mixing, requiring more samples to get close to the parse with maximum score. Ideally, we would change multiple heads in the parse tree simultaneously, and sample those choices from the corresponding conditional distribution of . While in general this is increasingly difficult with more heads, it is indeed tractable if the model corresponds to a first-order parser. One such sampling algorithm is the random walk sampler of Wilson (1996). It can be used to obtain i.i.d. samples from distributions of the form:
(3) |
where corresponds to a tree with a spcified root and is the exponential of the first-order score. is always a valid parse tree if we allow multiple children of the root and do not impose projective constraint. The algorithm in Wilson (1996) iterates over all the nodes, and for each node performs a random walk according to the weights until the walk creates a loop or hits a tree. In the first case the algorithm erases the loop and continues the walk. If the walk hits the current tree, the walk path is added to form a new tree with more nodes. This is repeated until all the nodes are included in the tree. It can be shown that this procedure generates i.i.d. trees from .
Since our features do not by design correspond to a first-order parser, we cannot use the Wilson algorithm as it is. Instead we use it as the proposal function and sample a subset of the dependencies from the first-order distribution of our model, while fixing the others. In each step we uniformly sample nodes to update and sample their new heads using the Wilson algorithm (in the experiments we use ). Note that blocked Gibbs sampling would be exponential in , and is thus very slow already at . The procedure is described in Figure 2 with a graphic illustration in Figure 3.
In this section, we describe how to learn the adjustable parameters in the scoring function. The parameters are learned in an on-line fashion by successively imposing soft constraints between pairs of dependency structures. We introduce both margin constraints and constraints pertaining to successive samples generated along the search path. We demonstrate later that both types of constraints are essential.
We begin with the standard margin constraints. An ideal scoring function would always rank the gold parse higher than any alternative. Moreover, alternatives that are far from the gold parse should score even lower. As a result, we require that
(4) |
where is the number of head mistakes in relative to the gold parse . We adopt here a shorthand , where the dependence on is implied from context. Note that Equation 4 contains exponentially many constraints and cannot be enforced jointly for general scoring functions. However, our sampling procedure generates a small number of structures along the search path. We enforce only constraints corresponding to those samples.
The second type of constraints are enforced between successive samples along the search path. To illustrate the idea, consider a parse that differs from in only one arc, and a parse that differs from in ten arcs. We cannot necessarily assume that is greater than without additional encouragement. Thus, we can complement the constraints in Equation 4 with additional pairwise constraints [32]:
(5) |
where similarly to Equation 4, the difference in scores scales with the differences in errors with respect to the target . We only enforce the above constraints for that are consecutive samples in the course of the sampling process. These constraints serve to guide the sampling process derived from the scoring function towards the gold parse.
We learn the parameters in an on-line fashion to satisfy the above constraints. This is done via the MIRA algorithm [7]. Specifically, if the current parameters are , and we enforce constraint Equation 5 for a particular pair , then we will find that minimizes
(6) |
The updates can be calculated in closed form. Figure 4 summarizes the learning algorithm. We repeatedly generate parses based on the current parameters for each sentence , and use successive samples to enforce constraints in Equation 4 and Equation 5 one at a time.
{algorithmic}\STATE
Inputs: . Outputs: Learned parameters . \FOR
\TO#epochs \TO \STATE
\STATE
\STATE
acceptOrReject() \STATE
\STATE
\IF
and updateMIRA() \ENDIF\STATE
\IF
\STATE
updateMIRA() \ENDIF\ENDFOR\ENDFOR\RETURN
Average of parameters.
It is easy to extend our sampling-based parsing framework to joint prediction of parsing and other labels. Specifically, when sampling the new heads, we can also sample the values of other variables at the same time. For instance, we can sample the POS tag, the dependency relation or morphology information. In this work, we investigate a joint POS correction scenario in which only the predicted POS tags are provided in the testing phase, while both gold and predicted tags are available for the training set.
We extend our model such that it jointly learns how to predict a parse tree and also correct the predicted POS tags for a better parsing performance. We generate the POS candidate list for each word based on the confusion matrix on the training set. Let be the count when the gold tag is and the predicted one is . For each word , we first prune out its POS candidates by using the vocabulary from the training set. We don’t prune anything if is unseen. Assuming that the predicted tag for is , we further remove those tags if their counts are smaller than some threshold 22In our work we choose , which gives a 98.9% oracle POS tagging accuracy on the CATiB development set..
After generating the candidate lists for each word, the rest of the extension is rather straightforward. For each sampling, let be the set of candidate heads and be the set of candidate POS tags. The Gibbs sampler will generate a new sample from the space . The other parts of the algorithm remain the same.
First- to Third-Order Features The feature templates of first- to third-order features are mainly drawn from previous work on graph-based parsing [23], transition-based parsing [25] and dual decomposition-based parsing [17]. As shown in Figure 5, the arc is the basic structure for first-order features. We also define features based on consecutive sibling, grandparent, arbitrary sibling, head bigram, grand-sibling and tri-siblings, which are also used in the Turbo parser [16]. In addition to these first- to third-order structures, we also consider grand-grandparent and sibling-grandchild structures. There are two types of sibling-grandchild structures: (1) inner-sibling when the sibling is between the head and the modifier and (2) outer-sibling for the other cases.
Global Features We used feature shown promising in prior reranking work Charniak and Johnson (2005), Collins (2000) and Huang (2008).
Right Branch This feature enables the model to prefer right or left-branching trees. It counts the number of words on the path from the root node to the right-most non-punctuation word, normalized by the length of the sentence.
Coordination In a coordinate structure, the two adjacent conjuncts usually agree with each other on POS tags and their span lengths. For instance, in cats and dogs, the conjuncts are both short noun phrases. Therefore, we add different features to capture POS tag and span length consistency in a coordinate structure.
PP Attachment We add features of lexical tuples involving the head, the argument and the preposition of prepositional phrases. Generally, this feature can be defined based on an instance of grandparent structure. However, we also handle the case of coordination. In this case, the arguments should be the conjuncts rather than the coordinator. Figure 6 shows an example.
Span Length This feature captures the distribution of the binned span length of each POS tag. It also includes flags of whether the span reaches the end of the sentence and whether the span is followed by the punctuation.
Neighbors The POS tags of the neighboring words to the left and right of each span, together with the binned span length and the POS tag at the span root.
Valency We consider valency features for each POS tag. Specifically, we add two types of valency information: (1) the binned number of non-punctuation modifiers and (2) the concatenated POS string of all those modifiers.
Non-projective Arcs A flag indicating if a dependency is projective or not (i.e. if it spans a word that does not descend from its head) [17]. This flag is also combined with the POS tags or the lexical words of the head and the modifier.
POS Tag Features In the joint POS correction scenario, we also add additional features specifically for POS prediction. The feature templates are inspired by previous feature-rich POS tagging work [31]. However, we are free to add higher order features because we do not rely on dynamic programming decoding. In our work we use feature templates up to 5-gram. Table 1 summarizes all POS tag feature templates.
1-gram | |
---|---|
2-gram | |
3-gram | |
4-gram | |
5-gram |
Datasets We evaluate our model on standard benchmark corpora — CoNLL 2006 and CoNLL 2008 [4, 30] — which include dependency treebanks for 14 different languages. Most of these data sets contain non-projective dependency trees. We use all sentences in CoNLL datasets during training and testing. We also use the Columbia Arabic Treebank (CATiB) [20]. CATiB mostly includes projective trees. The trees are annotated with both gold and predicted versions of POS tags and morphology information. Following Marton et al. (2013), for this dataset we use 12 core POS tags, word lemmas, determiner features, rationality features and functional genders and numbers.
Some CATiB sentences exceed 200 tokens. For efficiency, we limit the sentence length to 70 tokens in training and development sets. However, we do not impose this constraint during testing. We handle long sentences during testing by applying a simple split-merge strategy. We split the sentence based on the ending punctuation, predict the parse tree for each segment and group the roots of resulting trees into a single node.
Evaluation Measures Following standard practice, we use Unlabeled Attachment Score (UAS) as the evaluation metric in all our experiments. We report UAS excluding punctuation on CoNLL datasets, following Martins et al. (2013). For the CATiB dataset, we report UAS including punctuation in order to be consistent with the published results in the 2013 SPMRL shared task [28].
Baselines We compare our model with the Turbo parser and the MST parser. For the Turbo parser, we directly compare with the recent published results in [16]. For the MST parser, we train a second-order non-projective model using the most recent version of the code33http://sourceforge.net/projects/mstparser/.
We also compare our model against a discriminative reranker. The reranker operates over the top-50 list obtained from the MST parser44The MST parser is trained in projective mode for reranking because generating top-k list from second-order non-projective model is intractable.. We use a 10-fold cross-validation to generate candidate lists for training. We then train the reranker by running 10 epochs of cost-augmented MIRA. The reranker uses the same features as our model, along with the tree scores obtained from the MST parser (which is a standard practice in reranking).
Our Model (UAS) | Turbo (UAS) | MST 2nd-Ord. (UAS) | Best Published UAS | Top-50 Reranker | Top-500 Reranker | ||
---|---|---|---|---|---|---|---|
Turbo Feat. | Full Feat. | ||||||
Arabic | 79.86 | 80.21 | 79.64 | 78.75 | 81.12 (Ma11) | 79.03 | 78.91 |
Bulgarian | 92.97 | 93.30 | 93.10 | 91.56 | 94.02 (Zh13) | 92.81 | - |
Chinese | 92.06 | 92.63 | 89.98 | 91.77 | 91.89 (Ma10) | 92.25 | - |
Czech | 90.62 | 91.04 | 90.32 | 87.30 | 90.32 (Ma13) | 88.14 | - |
Danish | 91.45 | 91.80 | 91.48 | 90.50 | 92.00 (Zh13) | 90.88 | 90.91 |
Dutch | 85.83 | 86.47 | 86.19 | 84.11 | 86.19 (Ma13) | 81.01 | - |
English | 92.79 | 92.94 | 93.22 | 91.54 | 93.22 (Ma13) | 92.41 | - |
German | 91.79 | 92.07 | 92.41 | 90.14 | 92.41 (Ma13) | 91.19 | - |
Japanese | 93.23 | 93.42 | 93.52 | 92.92 | 93.72 (Ma11) | 93.40 | - |
Portuguese | 91.82 | 92.41 | 92.69 | 91.08 | 93.03 (Ko10) | 91.47 | - |
Slovene | 86.19 | 86.82 | 86.01 | 83.25 | 86.95 (Ma11) | 84.81 | 85.37 |
Spanish | 88.24 | 88.21 | 85.59 | 84.33 | 87.96 (Zh13) | 86.85 | 87.21 |
Swedish | 90.48 | 90.71 | 91.14 | 89.05 | 91.62 (Zh13) | 90.53 | - |
Turkish | 76.82 | 77.21 | 76.90 | 74.39 | 77.55 (Ko10) | 76.35 | 76.23 |
Average | 88.87 | 89.23 | 88.72 | 86.86 | 89.33 | 87.92 | - |
Experimental Details Following Koo and Collins (2010), we always first train a first-order pruner. For each word , we prune away the incoming dependencies with probability less than 0.005 times the probability of the most likely head, and limit the number of candidate heads up to 30. This gives a 99% pruning recall on the CATiB development set. The first-order model is also trained using the algorithm in Figure 4. After pruning, we tune the regularization parameter on development sets for different languages. Because the CoNLL datasets do not have a standard development set, we randomly select a held out of 200 sentences from the training set. We also pick the training epochs from which gives the best performance on the development set for each language. After tuning, the model is trained on the full training set with the selected parameters.
We apply the Random Walk-based sampling method (see Section 3.2.2) for the standard dependency parsing task. However, for the joint parsing and POS correction on the CATiB dataset we do not use the Random Walk method because the first-order features in normal parsing are no longer first-order when POS tags are also variables. Therefore, the first-order distribution is not well-defined and we only employ Gibbs sampling for simplicity. On the CATiB dataset, we restrict the sample trees to always be projective as described in Section 3.2.1. However, we do not impose this constraint for the CoNLL datasets.
Comparison with State-of-the-art Parsers Table 2 summarizes the performance of our model and of the baselines. We first compare our model to the Turbo parser using the Turbo parser feature set. This is meant to test how our learning and inference methods compare to a dual decomposition approach. The first column in Table 2 shows the result for our model with an average of 88.87%, and the third column shows the results for the Turbo parser with an average of 88.72%. This suggests that our learning and inference procedures are as effective as the dual decomposition method in the Turbo parser.
Next, we add global features that are not used by the Turbo parser. The performance of our model is shown in the second column with an average of 89.23%. It outperforms the Turbo parser by 0.5% and achieves the best reported performance on four languages. Moreover, our model also outperforms the 88.80% average UAS reported in Martins et al. (2011), which is the top performing single parsing system (to the best of our knowledge).
Comparison with Reranking As column 6 of Table 2 shows, our model outperforms the reranker by 1.3%55Note that the comparison is conservative because we can also add MST scores as features in our model as in reranker. With these features our model achieves an average UAS 89.28%.. One possible explanation of this performance gap between the reranker and our model is the small number of candidates considered by the reranker. To test this hypothesis, we performed experiments with top-500 list for a subset of languages.66We ran this experiment on 5 languages with small datasets due to the scalability issues associated with reranking top-500 list. As column 7 shows, this increase in the list size does not change the relative performance of the reranker and our model.
Joint Parsing and POS Correction Table 3 shows the results of joint parsing and POS correction on the CATiB dataset, for our model and state-of-the-art systems. As the upper part of the table shows, the parser with corrected tags reaches 88.38% compared to the accuracy of 88.46% on the gold tags. This is a substantial increase from the parser that uses predicted tags (86.95%).
To put these numbers into perspective, the bottom part of Table 3 shows the accuracy of the best systems from the 2013 SPMRL shared task on Arabic parsing using predicted information [28]. Our system not only outperforms the best single system [2] by 1.4%, but it also tops the ensemble system that combines three powerful parsers: the Mate parser [3], the Easy-First parser [10] and the Turbo parser [16]
Impact of Sampling Methods We compare two sampling methods introduced in Section 3.2 with respect to their decoding efficiency. Specifically, we measure the score of the retrieved trees in testing as a function of the decoding speed, measured by the number of tokens per second. We change the temperature update rate in order to decode with different speed. In Figure 9 we show the corresponding curves for two languages: Arabic and Chinese. We select these two languages as they correspond to two extremes in sentence length: Arabic has the longest sentences on average, while Chinese has the shortest ones. For both languages, the tree score improves over time. Given sufficient time, both sampling methods achieve the same score. However, the Random Walk-based sampler performs better when the quality is traded for speed. This result is to be expected given that each iteration of this sampler makes multiple changes to the tree, in contrast to a single-edge change of Gibbs sampler.
Dev. Set () | Testing Set | |||
POS Acc. | UAS | POS Acc. | UAS | |
Gold | - | 90.27 | - | 88.46 |
Predicted | 96.87 | 88.81 | 96.82 | 86.95 |
POS Correction | 97.72 | 90.08 | 97.49 | 88.38 |
CADIM | 96.87 | 87.4- | 96.82 | 85.78 |
IMS-Single | - | - | - | 86.96 |
IMS-Ensemble | - | - | - | 88.32 |
[b]0.45
{subfigure}[b]0.45
The Effect of Constraints in Learning Our training method updates parameters to satisfy the pairwise constraints between (1) subsequent samples on the sampling path and (2) selected samples and the ground truth. Figure 10 shows that applying both types of constraints is consistently better than using either of them alone. Moreover, these results demonstrate that comparison between subsequent samples is more important than comparison against the gold tree.
Decoding Speed Our sampling-based parser is an anytime algorithm, and therefore its running time can be traded for performance. Figure 9 illustrates this trade-off. In the experiments reported above, we chose a conservative cooling rate and continued to sample until the score no longer changed. The parser still managed to process all the datasets in a reasonable time. For example, the time that it took to decode all the test sentences in Chinese and Arabic were 3min and 15min, respectively. Our current implementation is in Java and can be further optimized for speed.
This paper demonstrates the power of combining a simple inference procedure with a highly expressive scoring function. Our model achieves the best results on the standard dependency parsing benchmark, outperforming parsing methods with elaborate inference procedures. In addition, this framework provides simple and effective means for joint parsing and corrective tagging.
This research is developed in collaboration with the Arabic Language Technologies (ALT) group at Qatar Computing Research Institute (QCRI) within the Iyas project. The authors acknowledge the support of the MURI program (W911NF-10-1-0533, the DARPA BOLT program and the US-Israel Binational Science Foundation (BSF, Grant No 2012330). We thank the MIT NLP group and the ACL reviewers for their comments.