Steps to Excellence: Simple Inference with Refined Scoring of Dependency Trees

      Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola
     Massachusetts Institute of Technology
      {yuanzh, taolei, regina, tommi}@csail.mit.edu
&            Amir Globerson
            The Hebrew University
            gamir@cs.huji.ac.il

Abstract

Much of the recent work on dependency parsing has been focused on solving inherent combinatorial problems associated with rich scoring functions. In contrast, we demonstrate that highly expressive scoring functions can be used with substantially simpler inference procedures. Specifically, we introduce a sampling-based parser that can easily handle arbitrary global features. Inspired by SampleRank, we learn to take guided stochastic steps towards a high scoring parse. We introduce two samplers for traversing the space of trees, Gibbs and Metropolis-Hastings with Random Walk. The model outperforms state-of-the-art results when evaluated on 14 languages of non-projective CoNLL datasets. Our sampling-based approach naturally extends to joint prediction scenarios, such as joint parsing and POS correction. The resulting method outperforms the best reported results on the CATiB dataset, approaching performance of parsing with gold tags.¹¹The source code for the work is available at http://groups.csail.mit.edu/rbg/code/global/acl2014.

1 Introduction

Dependency parsing is commonly cast as a maximization problem over a parameterized scoring function. In this view, the use of more expressive scoring functions leads to more challenging combinatorial problems of finding the maximizing parse. Much of the recent work on parsing has been focused on improving methods for solving the combinatorial maximization inference problems. Indeed, state-of-the-art results have been obtained by adapting powerful tools from optimization [16, 17, 27]. We depart from this view and instead focus on using highly expressive scoring functions with substantially simpler inference procedures. The key ingredient in our approach is how learning is coupled with inference. Our combination outperforms the state-of-the-art parsers and remains comparable even if we adopt their scoring functions.

Rich scoring functions have been used for some time. They first appeared in the context of reranking [6], where a simple parser is used to generate a candidate list which is then reranked according to the scoring function. Because the number of alternatives is small, the scoring function could in principle involve arbitrary (global) features of parse trees. The power of this methodology is nevertheless limited by the initial set of alternatives from the simpler parser. Indeed, the set may already omit the gold parse. We dispense with the notion of a candidate set and seek to exploit the scoring function more directly.

In this paper, we introduce a sampling-based parser that places few or no constraints on the scoring function. Starting with an initial candidate tree, our inference procedure climbs the scoring function in small (cheap) stochastic steps towards a high scoring parse. The proposal distribution over the moves is derived from the scoring function itself. Because the steps are small, the complexity of the scoring function has limited impact on the computational cost of the procedure. We explore two alternative proposal distributions. Our first strategy is akin to Gibbs sampling and samples a new head for each word in the sentence, modifying one arc at a time. The second strategy relies on a provably correct sampler for first-order scores [33], and uses it within a Metropolis-Hastings algorithm for general scoring functions. It turns out that the latter optimizes the score more efficiently than the former.

Because the inference procedure is so simple, it is important that the parameters of the scoring function are chosen in a manner that facilitates how we climb the scoring function in small steps. One way to achieve this is to make sure that improvements in the scoring functions are correlated with improvements in the quality of the parse. This approach was suggested in the SampleRank framework [32] for training structured prediction models. This method was originally developed for a sequence labeling task with local features, and was shown to be more effective than state-of-the-art alternatives. Here we apply SampleRank to parsing, applying several modifications such as the proposal distributions mentioned earlier.

The benefits of sampling-based learning go beyond stand-alone parsing. For instance, we can use the framework to correct preprocessing mistakes in features such as part-of-speech (POS) tags. In this case, we combine the scoring function for trees with a stand-alone tagging model. When proposing a small move, i.e., sampling a head of the word, we can also jointly sample its POS tag from a set of alternatives provided by the tagger. As a result, the selected tag is influenced by a broad syntactic context above and beyond the initial tagging model and is directly optimized to improve parsing performance. Our joint parsing-tagging model provides an alternative to the widely-adopted pipeline setup.

We evaluate our method on benchmark multilingual dependency corpora. Our method outperforms the Turbo parser across 14 languages on average by 0.5%. On four languages, we top the best published results. Our method provides a more effective mechanism for handling global features than reranking, outperforming it by 1.3%. In terms of joint parsing and tagging on the CATiB dataset, we nearly bridge (88.38%) the gap between independently predicted (86.95%) and gold tags (88.45%). This is better than the best published results in the 2013 SPMRL shared task [28], including parser ensembles.

2 Related Work

Earlier works on dependency parsing focused on inference with tractable scoring functions. For instance, a scoring function that operates over each single dependency can be optimized using the maximum spanning tree algorithm [21]. It was soon realized that using higher order features could be beneficial, even at the cost of using approximate inference and sacrificing optimality. The first successful approach in this arena was reranking [6, 5] on constituency parsing. Reranking can be combined with an arbitrary scoring function, and thus can easily incorporate global features over the entire parse tree. Its main disadvantage is that the output parse can only be one of the few parses passed to the reranker.

Recent work has focused on more powerful inference mechanisms that consider the full search space [34, 27, 14, 12]. For instance, Nakagawa (2007) deals with tractability issues by using sampling to approximate marginals. Another example is the dual decomposition (DD) framework [14, 17]. The idea in DD is to decompose the hard maximization problem into smaller parts that can be efficiently maximized and enforce agreement among these via Lagrange multipliers. The method is essentially equivalent to linear programming relaxation approaches [19, 29], and also similar in spirit to ILP approaches [26].

A natural approach to approximate global inference is via search. For instance, a transition-based parsing system [36] incrementally constructs a parsing structure using greedy beam-search. Other approaches operate over full trees and generate a sequence of candidates that successively increase the score [8, 15, 32]. Our work builds on one such approach — SampleRank [32], a sampling-based learning algorithm. In SampleRank, the parameters are adjusted so as to guide the sequence of candidates closer to the target structure along the search path. The method has been successfully used in sequence labeling and machine translation [11]. In this paper, we demonstrate how to adapt the method for parsing with rich scoring functions.

3 Sampling-Based Dependency Parsing with Global Features

In this section, we introduce our novel sampling-based dependency parser which can incorporate arbitrary global features. We begin with the notation before addressing the decoding and learning algorithms. Finally, we extend our model to a joint parsing and POS correction task.

3.1 Notations

We denote sentences by $x$ and the corresponding dependency trees by $y\in\mathcal{Y}(x)$ . Here $\mathcal{Y}(x)$ is the set of valid (projective or non-projective) dependency trees for sentence $x$ . We use $x_{j}$ to refer to the $j$ th word of sentence $x$ , and $h_{j}$ to the head word of $x_{j}$ . A training set of size $N$ is given as a set of pairs $\mathcal{D}=\{(x^{(i)},y^{(i)})\}_{i=1}^{N}$ where $y^{(i)}$ is the ground truth parse for sentence $x^{(i)}$ .

We parameterize the scoring function $s(x,y)$ as

s(x,y)=\theta\cdot f(x,y)

(1)

where $f(x,y)$ is the feature vector associated with tree $y$ for sentence $x$ . We do not make any assumptions about how the feature function decomposes. In contrast, most state-of-the-art parsers operate under the assumption that the feature function decomposes into a sum of simpler terms. For example, in the second-order MST parser [23], all the feature terms involve arcs or consecutive siblings. Similarly, parsers based on dual decomposition [17, 14] assume that $s(x,y)$ decomposes into a sum of terms where each term can be maximized over $y$ efficiently.

3.2 Decoding

The decoding problem consists of finding a valid dependency tree $y\in\mathcal{Y}(x)$ that maximizes the score $s(x,y)=\theta\cdot f(x,y)$ with parameters $\theta$ . For scoring functions that extend beyond first-order arc preferences, finding the maximizing non-projective tree is known to be NP-hard [23]. We find a high scoring tree through sampling, and (later) learn the parameters $\theta$ so as to further guide this process.

Our sampler generates a sequence of dependency structures so as to approximate independent samples from

p(y|x,T,\theta)\propto\exp\left(s(x,y)/T\right)\vspace{-0.2em}

(2)

The temperature parameter $T$ controls how concentrated the samples are around the maximum of $s(x,y)$ (e.g., see Geman and Geman (1984)). Sampling from target distribution $p$ is typically as hard as (or harder than) that maximizing $s(x,y)$ . We follow here a Metropolis-Hastings sampling algorithm (e.g., see Andrieu et al. (2003)) and explore different alternative proposal distributions $q(y^{\prime}|x,y,\theta,T)$ . The distribution $q$ governs the small steps that are taken in generating a sequence of structures. The target distribution $p$ folds into the procedure by defining the probability that we will accept the proposed move. The general structure of our sampling algorithm is given in Figure 1.

{algorithmic}\STATE

Inputs: $\theta$ , $x$ , $T_{0}$ (initial temperature), $c$ (temperature update rate), proposal distribution $q$ .

\STATE

Outputs: $y^{*}$

\STATE

$T\leftarrow T_{0}$

\STATE

Set $y^{0}$ to some random tree

\STATE

$y^{*}\leftarrow y^{0}$

\REPEAT\STATE

$y^{\prime}\leftarrow q(\cdot|x,y^{t},T,\theta)$

\IF

$s(x,y^{\prime})>s(x,y^{*})$

\STATE

$y^{*}\leftarrow y^{\prime}$

\ENDIF\STATE

$\alpha=\min\left[1,\frac{p(y^{\prime})q(y^{t}|y^{\prime})}{p(y^{t})q(y^{\prime% }|y^{t})}\right]$

\STATE

Sample Bernouli variable $Z$ with $P[Z=1]=\alpha$ . \IF $Z=0$ \STATE $y^{t+1}\leftarrow y^{t}$ \ELSE\STATE $y^{t+1}\leftarrow y^{\prime}$ \ENDIF

\STATE

$t\leftarrow t+1$

\STATE

$T\leftarrow c\cdot T$

\UNTIL

convergence

\RETURN

$y^{*}$

Figure 1: Sampling-based algorithm for decoding (i.e., approximately maximizing

s(x,y)

{algorithmic}\STATE

Inputs: $x$ , $y^{t}$ , $\theta$ , $K$ (number of heads to change).

\STATE

Outputs: $y^{\prime}$

\FOR

$i=1$ \TO $|x|$

\STATE

$inTree[i]\leftarrow false$ \STATE $ChangeNode[i]\leftarrow false$ \ENDFOR\STATESet $C h a n g e N o d e$ to true for $K$ random nodes.

\STATE

$head[0]\leftarrow-1$

\FOR

$i=1$ \TO $|x|$

\STATE

$u\leftarrow i$

\WHILE\NOT

$inTree[u]$

\IF

$ChangeNode[u]$

\STATE

$head[u]\leftarrow$ randomHead( $u,\theta$ )

\ELSE\STATE

$head[u]\leftarrow y^{t}(u)$

\ENDIF\STATE

$u\leftarrow head[u]$

\IF

LoopExist( $h e a d$ )

\STATE

EraseLoop( $h e a d$ )

\ENDIF\ENDWHILE\STATE

$u\leftarrow i$

\WHILE\NOT

$inTree[u]$

\STATE

$inTree[u]\leftarrow true$

\STATE

$u\leftarrow head[u]$

\ENDWHILE\ENDFOR\RETURN

Construct tree $y^{\prime}$ from the head array.

Figure 2: A proposal distribution

q(y^{\prime}|y^{t})

based on the random walk sampler of Wilson (1996). The function randomHead samples a new head for node

u

according to the first-order weights given by

\theta

Figure 3: An illustration of random walk sampler. The index on each edge indicates its order on each walk path. The heads of the red words are sampled while others are fixed. The blue edges represent the current walk path and the black ones are already in the tree. Note that the walk direction is opposite to the dependency direction. (a) shows the original tree before sampling; (b) and (c) show the walk path and how the tree is generated in two steps. The loop not

\rightarrow

Monday

\rightarrow

not in (b) is erased.

3.2.1 Gibbs Sampling

Perhaps the most natural choice of the proposal distribution $q$ is a conditional distribution from $p$ . This is feasible if we restrict the proposed moves to only small changes in the current tree. In our case, we choose a word $j$ randomly, and then sample its head $h_{j}$ according to $p$ with the constraint that we obtain a valid tree (when projective trees are sought, this constraint is also incorporated). For this choice of $q$ , the probability of accepting the new tree ( $\alpha$ in Figure 1) is identically one. Thus new moves are always accepted.

3.2.2 Exact First-Order Sampling

One shortcoming of the Gibbs sampler is that it only changes one variable (arc) at a time. This usually leads to slow mixing, requiring more samples to get close to the parse with maximum score. Ideally, we would change multiple heads in the parse tree simultaneously, and sample those choices from the corresponding conditional distribution of $p$ . While in general this is increasingly difficult with more heads, it is indeed tractable if the model corresponds to a first-order parser. One such sampling algorithm is the random walk sampler of Wilson (1996). It can be used to obtain i.i.d. samples from distributions of the form:

p(y)\propto\prod_{i\rightarrow j\in y}w_{ij},

(3)

where $y$ corresponds to a tree with a spcified root and $w_{ij}$ is the exponential of the first-order score. $y$ is always a valid parse tree if we allow multiple children of the root and do not impose projective constraint. The algorithm in Wilson (1996) iterates over all the nodes, and for each node performs a random walk according to the weights $w_{ij}$ until the walk creates a loop or hits a tree. In the first case the algorithm erases the loop and continues the walk. If the walk hits the current tree, the walk path is added to form a new tree with more nodes. This is repeated until all the nodes are included in the tree. It can be shown that this procedure generates i.i.d. trees from $p(y)$ .

Since our features do not by design correspond to a first-order parser, we cannot use the Wilson algorithm as it is. Instead we use it as the proposal function and sample a subset of the dependencies from the first-order distribution of our model, while fixing the others. In each step we uniformly sample $K$ nodes to update and sample their new heads using the Wilson algorithm (in the experiments we use $K=4$ ). Note that blocked Gibbs sampling would be exponential in $K$ , and is thus very slow already at $K=4$ . The procedure is described in Figure 2 with a graphic illustration in Figure 3.

3.3 Training

In this section, we describe how to learn the adjustable parameters $\theta$ in the scoring function. The parameters are learned in an on-line fashion by successively imposing soft constraints between pairs of dependency structures. We introduce both margin constraints and constraints pertaining to successive samples generated along the search path. We demonstrate later that both types of constraints are essential.

We begin with the standard margin constraints. An ideal scoring function would always rank the gold parse higher than any alternative. Moreover, alternatives that are far from the gold parse should score even lower. As a result, we require that

s(x^{(i)},y^{(i)})-s(x^{(i)},y)\geq\Delta(y^{(i)},y)\quad\forall y

(4)

where $\Delta(y^{(i)},y)$ is the number of head mistakes in $y$ relative to the gold parse $y^{(i)}$ . We adopt here a shorthand $Err(y)=\Delta(y^{(i)},y)$ , where the dependence on $y^{(i)}$ is implied from context. Note that Equation 4 contains exponentially many constraints and cannot be enforced jointly for general scoring functions. However, our sampling procedure generates a small number of structures along the search path. We enforce only constraints corresponding to those samples.

The second type of constraints are enforced between successive samples along the search path. To illustrate the idea, consider a parse $y$ that differs from $y^{(i)}$ in only one arc, and a parse $y^{\prime}$ that differs from $y^{(i)}$ in ten arcs. We cannot necessarily assume that $s(x,y)$ is greater than $s(x,y^{\prime})$ without additional encouragement. Thus, we can complement the constraints in Equation 4 with additional pairwise constraints [32]:

s(x^{(i)},y)-s(x^{(i)},y^{\prime})\geq Err(y^{\prime})-Err(y)

(5)

where similarly to Equation 4, the difference in scores scales with the differences in errors with respect to the target $y^{(i)}$ . We only enforce the above constraints for $y,y^{\prime}$ that are consecutive samples in the course of the sampling process. These constraints serve to guide the sampling process derived from the scoring function towards the gold parse.

We learn the parameters $\theta$ in an on-line fashion to satisfy the above constraints. This is done via the MIRA algorithm [7]. Specifically, if the current parameters are $\theta_{t}$ , and we enforce constraint Equation 5 for a particular pair $y,y^{\prime}$ , then we will find $\theta_{t+1}$ that minimizes

\begin{array}[]{ll}\min&||\theta-\theta_{t}||^{2}+C\xi\\ {\mbox{s}.t.}&\theta\cdot\left(f(x,y)-f(x,y^{\prime})\right)\geq Err(y^{\prime% })-Err(y)-\xi\end{array}

(6)

The updates can be calculated in closed form. Figure 4 summarizes the learning algorithm. We repeatedly generate parses based on the current parameters $\theta_{t}$ for each sentence $x^{(i)}$ , and use successive samples to enforce constraints in Equation 4 and Equation 5 one at a time.

{algorithmic}\STATE

Inputs: $\mathcal{D}=\{(x^{(i)},y^{(i)})\}_{i=1}^{N}$ .

\STATE

Outputs: Learned parameters $\theta$ .

\STATE

$\theta_{0}\leftarrow\textbf{0}$

\FOR

$e=1$ \TO#epochs

\FOR

$i=1$ \TO $N$

\STATE

$y^{\prime}\leftarrow q(\cdot|x^{(i)},y_{i}^{t_{i}},\theta_{t})$

\STATE

$y^{+}=\operatorname*{arg\,min}_{y\in\left\{y_{i}^{t_{i}},y^{\prime}\right\}}% Err(y)$

\STATE

$y^{-}=\operatorname*{arg\,max}_{y\in\left\{y_{i}^{t_{i}},y^{\prime}\right\}}% Err(y)$

\STATE

$y_{i}^{t_{i}+1}\leftarrow$ acceptOrReject( $y^{\prime},y_{i}^{t_{i}},\theta_{t}$ )

\STATE

$t_{i}\leftarrow t_{i}+1$

\STATE

$\nabla f=f(x^{(i)},y^{+})-f(x^{(i)},y^{-})$

\STATE

$\Delta Err=Err(y^{+})-Err(y^{-})$

\IF

$\Delta Err\neq 0$ and $\theta_{t}\cdot\nabla f<\Delta Err$

\STATE

$\theta_{t+1}\leftarrow$ updateMIRA( $\nabla f,\Delta Err,\theta_{t}$ )

\STATE

$t\leftarrow t+1$

\ENDIF\STATE

$\nabla f_{g}=f(x^{(i)},y^{(i)})-f(x^{(i)},y_{i}^{t_{i}})$

\IF

$\theta_{t}\cdot\nabla f_{g}<Err(y_{i}^{t_{i}})$

\STATE

$\theta_{t+1}\leftarrow$ updateMIRA( $\nabla f_{g},Err(y_{i}^{t_{i}}),\theta_{t}$ )

\STATE

$t\leftarrow t+1$

\ENDIF\ENDFOR\ENDFOR\RETURN

Average of $\theta_{0},\ldots,\theta_{t}$ parameters.

Figure 4: SampleRank algorithm for learning. The rejection strategy is as in Figure 1.

y_{i}^{t_{i}}

is the

t_{i}

th tree sample of

x^{(i)}

. The first MIRA update (see Equation 6) enforces a ranking constraint between two sampled parses. The second MIRA update enforces constraints between a sampled parse and the gold parse. In practice several samples are drawn for each sentence in each epoch.

3.4 Joint Parsing and POS Correction

It is easy to extend our sampling-based parsing framework to joint prediction of parsing and other labels. Specifically, when sampling the new heads, we can also sample the values of other variables at the same time. For instance, we can sample the POS tag, the dependency relation or morphology information. In this work, we investigate a joint POS correction scenario in which only the predicted POS tags are provided in the testing phase, while both gold and predicted tags are available for the training set.

We extend our model such that it jointly learns how to predict a parse tree and also correct the predicted POS tags for a better parsing performance. We generate the POS candidate list for each word based on the confusion matrix on the training set. Let $c(t_{g},t_{p})$ be the count when the gold tag is $t_{g}$ and the predicted one is $t_{p}$ . For each word $w$ , we first prune out its POS candidates by using the vocabulary from the training set. We don’t prune anything if $w$ is unseen. Assuming that the predicted tag for $w$ is $t_{p}$ , we further remove those tags $t$ if their counts are smaller than some threshold $c(t,t_{p})<\alpha\cdot c(t_{p},t_{p})$ ²²In our work we choose $\alpha=0.003$ , which gives a 98.9% oracle POS tagging accuracy on the CATiB development set..

After generating the candidate lists for each word, the rest of the extension is rather straightforward. For each sampling, let $\mathcal{H}$ be the set of candidate heads and $\mathcal{T}$ be the set of candidate POS tags. The Gibbs sampler will generate a new sample from the space $\mathcal{H}\times\mathcal{T}$ . The other parts of the algorithm remain the same.

Figure 5: First- to third-order features.

4 Features

First- to Third-Order Features The feature templates of first- to third-order features are mainly drawn from previous work on graph-based parsing [23], transition-based parsing [25] and dual decomposition-based parsing [17]. As shown in Figure 5, the arc is the basic structure for first-order features. We also define features based on consecutive sibling, grandparent, arbitrary sibling, head bigram, grand-sibling and tri-siblings, which are also used in the Turbo parser [16]. In addition to these first- to third-order structures, we also consider grand-grandparent and sibling-grandchild structures. There are two types of sibling-grandchild structures: (1) inner-sibling when the sibling is between the head and the modifier and (2) outer-sibling for the other cases.

Global Features We used feature shown promising in prior reranking work Charniak and Johnson (2005), Collins (2000) and Huang (2008).

•

Right Branch This feature enables the model to prefer right or left-branching trees. It counts the number of words on the path from the root node to the right-most non-punctuation word, normalized by the length of the sentence.
•

Coordination In a coordinate structure, the two adjacent conjuncts usually agree with each other on POS tags and their span lengths. For instance, in cats and dogs, the conjuncts are both short noun phrases. Therefore, we add different features to capture POS tag and span length consistency in a coordinate structure.
•

PP Attachment We add features of lexical tuples involving the head, the argument and the preposition of prepositional phrases. Generally, this feature can be defined based on an instance of grandparent structure. However, we also handle the case of coordination. In this case, the arguments should be the conjuncts rather than the coordinator. Figure 6 shows an example.

Figure 6: An example of PP attachment with coordination. The arguments should be knife and fork, not and.
•

Span Length This feature captures the distribution of the binned span length of each POS tag. It also includes flags of whether the span reaches the end of the sentence and whether the span is followed by the punctuation.
•

Neighbors The POS tags of the neighboring words to the left and right of each span, together with the binned span length and the POS tag at the span root.
•

Valency We consider valency features for each POS tag. Specifically, we add two types of valency information: (1) the binned number of non-punctuation modifiers and (2) the concatenated POS string of all those modifiers.
•

Non-projective Arcs A flag indicating if a dependency is projective or not (i.e. if it spans a word that does not descend from its head) [17]. This flag is also combined with the POS tags or the lexical words of the head and the modifier.

POS Tag Features In the joint POS correction scenario, we also add additional features specifically for POS prediction. The feature templates are inspired by previous feature-rich POS tagging work [31]. However, we are free to add higher order features because we do not rely on dynamic programming decoding. In our work we use feature templates up to 5-gram. Table 1 summarizes all POS tag feature templates.

1-gram	$\langle t_{i}\rangle,\langle t_{i},w_{i-2}\rangle,\langle t_{i},w_{i-1}\rangle% ,\langle t_{i},w_{i}\rangle,\langle t_{i},w_{i+1}\rangle,$
1-gram	$\langle t_{i},w_{i+2}\rangle$
2-gram	$\langle t_{i-1},t_{i}\rangle,\langle t_{i-2},t_{i}\rangle,\langle t_{i-1},t_{i% },w_{i-1}\rangle,$
2-gram	$\langle t_{i-1},t_{i},w_{i}\rangle$
3-gram	$\langle t_{i-1},t_{i},t_{i+1}\rangle,\langle t_{i-2},t_{i},t_{i+1},\rangle,% \langle t_{i-1},t_{i},t_{i+2}\rangle,$
3-gram	$\langle t_{i-2},t_{i},t_{i+2}\rangle$
4-gram	$\langle t_{i-2},t_{i-1},t_{i},t_{i+1}\rangle,\langle t_{i-2},t_{i-1},t_{i},t_{% i+2}\rangle,$
4-gram	$\langle t_{i-2},t_{i},t_{i+1},t_{i+2}\rangle$
5-gram	$\langle t_{i-2},t_{i-1},t_{i},t_{i+1},t_{i+2}\rangle$

Table 1: POS tag feature templates.

t_{i}

and

w_{i}

denotes the POS tag and the word at the current position.

t_{i-x}

and

t_{i+x}

denote the left and right context tags, and similarly for words.

5 Experimental Setup

Datasets We evaluate our model on standard benchmark corpora — CoNLL 2006 and CoNLL 2008 [4, 30] — which include dependency treebanks for 14 different languages. Most of these data sets contain non-projective dependency trees. We use all sentences in CoNLL datasets during training and testing. We also use the Columbia Arabic Treebank (CATiB) [20]. CATiB mostly includes projective trees. The trees are annotated with both gold and predicted versions of POS tags and morphology information. Following Marton et al. (2013), for this dataset we use 12 core POS tags, word lemmas, determiner features, rationality features and functional genders and numbers.

Some CATiB sentences exceed 200 tokens. For efficiency, we limit the sentence length to 70 tokens in training and development sets. However, we do not impose this constraint during testing. We handle long sentences during testing by applying a simple split-merge strategy. We split the sentence based on the ending punctuation, predict the parse tree for each segment and group the roots of resulting trees into a single node.

Evaluation Measures Following standard practice, we use Unlabeled Attachment Score (UAS) as the evaluation metric in all our experiments. We report UAS excluding punctuation on CoNLL datasets, following Martins et al. (2013). For the CATiB dataset, we report UAS including punctuation in order to be consistent with the published results in the 2013 SPMRL shared task [28].

Baselines We compare our model with the Turbo parser and the MST parser. For the Turbo parser, we directly compare with the recent published results in [16]. For the MST parser, we train a second-order non-projective model using the most recent version of the code³³http://sourceforge.net/projects/mstparser/.

We also compare our model against a discriminative reranker. The reranker operates over the top-50 list obtained from the MST parser⁴⁴The MST parser is trained in projective mode for reranking because generating top-k list from second-order non-projective model is intractable.. We use a 10-fold cross-validation to generate candidate lists for training. We then train the reranker by running 10 epochs of cost-augmented MIRA. The reranker uses the same features as our model, along with the tree scores obtained from the MST parser (which is a standard practice in reranking).

	Our Model (UAS)		Turbo (UAS)	MST 2nd-Ord. (UAS)	Best Published UAS	Top-50 Reranker	Top-500 Reranker
	Turbo Feat.	Full Feat.	Turbo (UAS)	MST 2nd-Ord. (UAS)	Best Published UAS	Top-50 Reranker	Top-500 Reranker
Arabic	79.86	80.21	79.64	78.75	81.12 (Ma11)	79.03	78.91
Bulgarian	92.97	93.30	93.10	91.56	94.02 (Zh13)	92.81	-
Chinese	92.06	92.63	89.98	91.77	91.89 (Ma10)	92.25	-
Czech	90.62	91.04	90.32	87.30	90.32 (Ma13)	88.14	-
Danish	91.45	91.80	91.48	90.50	92.00 (Zh13)	90.88	90.91
Dutch	85.83	86.47	86.19	84.11	86.19 (Ma13)	81.01	-
English	92.79	92.94	93.22	91.54	93.22 (Ma13)	92.41	-
German	91.79	92.07	92.41	90.14	92.41 (Ma13)	91.19	-
Japanese	93.23	93.42	93.52	92.92	93.72 (Ma11)	93.40	-
Portuguese	91.82	92.41	92.69	91.08	93.03 (Ko10)	91.47	-
Slovene	86.19	86.82	86.01	83.25	86.95 (Ma11)	84.81	85.37
Spanish	88.24	88.21	85.59	84.33	87.96 (Zh13)	86.85	87.21
Swedish	90.48	90.71	91.14	89.05	91.62 (Zh13)	90.53	-
Turkish	76.82	77.21	76.90	74.39	77.55 (Ko10)	76.35	76.23
Average	88.87	89.23	88.72	86.86	89.33	87.92	-

Table 2: Results of our model, the Turbo parser, and the MST parser. “Best Published UAS” includes the most accurate parsers among Nivre et al. (2006), McDonald et al. (2006), Martins et al. (2010), Martins et al. (2011), Martins et al. (2013), Koo et al. (2010), Rush and Petrov (2012), Zhang and McDonald (2012) and Zhang et al. (2013). Martins et al. (2013) is the current Turbo parser. The last two columns shows UAS of the discriminative reranker.

Experimental Details Following Koo and Collins (2010), we always first train a first-order pruner. For each word $x_{i}$ , we prune away the incoming dependencies $\langle h_{i},x_{i}\rangle$ with probability less than 0.005 times the probability of the most likely head, and limit the number of candidate heads up to 30. This gives a 99% pruning recall on the CATiB development set. The first-order model is also trained using the algorithm in Figure 4. After pruning, we tune the regularization parameter $C=\{0.1,0.01,0.001\}$ on development sets for different languages. Because the CoNLL datasets do not have a standard development set, we randomly select a held out of 200 sentences from the training set. We also pick the training epochs from $\{50,100,150\}$ which gives the best performance on the development set for each language. After tuning, the model is trained on the full training set with the selected parameters.

We apply the Random Walk-based sampling method (see Section 3.2.2) for the standard dependency parsing task. However, for the joint parsing and POS correction on the CATiB dataset we do not use the Random Walk method because the first-order features in normal parsing are no longer first-order when POS tags are also variables. Therefore, the first-order distribution is not well-defined and we only employ Gibbs sampling for simplicity. On the CATiB dataset, we restrict the sample trees to always be projective as described in Section 3.2.1. However, we do not impose this constraint for the CoNLL datasets.

6 Results

Comparison with State-of-the-art Parsers Table 2 summarizes the performance of our model and of the baselines. We first compare our model to the Turbo parser using the Turbo parser feature set. This is meant to test how our learning and inference methods compare to a dual decomposition approach. The first column in Table 2 shows the result for our model with an average of 88.87%, and the third column shows the results for the Turbo parser with an average of 88.72%. This suggests that our learning and inference procedures are as effective as the dual decomposition method in the Turbo parser.

Next, we add global features that are not used by the Turbo parser. The performance of our model is shown in the second column with an average of 89.23%. It outperforms the Turbo parser by 0.5% and achieves the best reported performance on four languages. Moreover, our model also outperforms the 88.80% average UAS reported in Martins et al. (2011), which is the top performing single parsing system (to the best of our knowledge).

Comparison with Reranking As column 6 of Table 2 shows, our model outperforms the reranker by 1.3%⁵⁵Note that the comparison is conservative because we can also add MST scores as features in our model as in reranker. With these features our model achieves an average UAS 89.28%.. One possible explanation of this performance gap between the reranker and our model is the small number of candidates considered by the reranker. To test this hypothesis, we performed experiments with top-500 list for a subset of languages.⁶⁶We ran this experiment on 5 languages with small datasets due to the scalability issues associated with reranking top-500 list. As column 7 shows, this increase in the list size does not change the relative performance of the reranker and our model.

Joint Parsing and POS Correction Table 3 shows the results of joint parsing and POS correction on the CATiB dataset, for our model and state-of-the-art systems. As the upper part of the table shows, the parser with corrected tags reaches 88.38% compared to the accuracy of 88.46% on the gold tags. This is a substantial increase from the parser that uses predicted tags (86.95%).

To put these numbers into perspective, the bottom part of Table 3 shows the accuracy of the best systems from the 2013 SPMRL shared task on Arabic parsing using predicted information [28]. Our system not only outperforms the best single system [2] by 1.4%, but it also tops the ensemble system that combines three powerful parsers: the Mate parser [3], the Easy-First parser [10] and the Turbo parser [16]

Impact of Sampling Methods We compare two sampling methods introduced in Section 3.2 with respect to their decoding efficiency. Specifically, we measure the score of the retrieved trees in testing as a function of the decoding speed, measured by the number of tokens per second. We change the temperature update rate $c$ in order to decode with different speed. In Figure 9 we show the corresponding curves for two languages: Arabic and Chinese. We select these two languages as they correspond to two extremes in sentence length: Arabic has the longest sentences on average, while Chinese has the shortest ones. For both languages, the tree score improves over time. Given sufficient time, both sampling methods achieve the same score. However, the Random Walk-based sampler performs better when the quality is traded for speed. This result is to be expected given that each iteration of this sampler makes multiple changes to the tree, in contrast to a single-edge change of Gibbs sampler.

	Dev. Set ( $\leq 70$ )		Testing Set
	POS Acc.	UAS	POS Acc.	UAS
Gold	-	90.27	-	88.46
Predicted	96.87	88.81	96.82	86.95
POS Correction	97.72	90.08	97.49	88.38
CADIM	96.87	87.4-	96.82	85.78
IMS-Single	-	-	-	86.96
IMS-Ensemble	-	-	-	88.32

Table 3: Results for parsing and corrective tagging on the CATiB dataset. The upper part shows UAS of our model with gold/predicted information or POS correction. Bottom part shows UAS of the best systems in the SPMRL shared task. IMS-Single [2] is the best single parsing system, while IMS-Ensemble [2] is the best ensemble parsing system. We also show results for CADIM [20], the second best system, because we use their predicted features.

{subfigure}

[b]0.45

Figure 7: Arabic

{subfigure}

[b]0.45

Figure 8: Chinese

Figure 9: Total score of the predicted test trees as a function of the decoding speed, measured in the number of tokens per second.

The Effect of Constraints in Learning Our training method updates parameters to satisfy the pairwise constraints between (1) subsequent samples on the sampling path and (2) selected samples and the ground truth. Figure 10 shows that applying both types of constraints is consistently better than using either of them alone. Moreover, these results demonstrate that comparison between subsequent samples is more important than comparison against the gold tree.

Figure 10: UAS on four languages when training with different constraints. “Neighbor” corresponds to pairwise constraints between subsequent samples, “Gold” represents constraints between a single sample and the ground truth, “Both” means applying both types of constraints.

Decoding Speed Our sampling-based parser is an anytime algorithm, and therefore its running time can be traded for performance. Figure 9 illustrates this trade-off. In the experiments reported above, we chose a conservative cooling rate and continued to sample until the score no longer changed. The parser still managed to process all the datasets in a reasonable time. For example, the time that it took to decode all the test sentences in Chinese and Arabic were 3min and 15min, respectively. Our current implementation is in Java and can be further optimized for speed.

7 Conclusions

This paper demonstrates the power of combining a simple inference procedure with a highly expressive scoring function. Our model achieves the best results on the standard dependency parsing benchmark, outperforming parsing methods with elaborate inference procedures. In addition, this framework provides simple and effective means for joint parsing and corrective tagging.

Acknowledgments

This research is developed in collaboration with the Arabic Language Technologies (ALT) group at Qatar Computing Research Institute (QCRI) within the Iyas project. The authors acknowledge the support of the MURI program (W911NF-10-1-0533, the DARPA BOLT program and the US-Israel Binational Science Foundation (BSF, Grant No 2012330). We thank the MIT NLP group and the ACL reviewers for their comments.

References

[1] C. Andrieu, N. De Freitas, A. Doucet and M. I. Jordan(2003) An introduction to mcmc for machine learning. Machine learning 50 (1-2), pp. 5–43. Cited by: 3.2.
[2] A. Björkelund, O. Cetinoglu, R. Farkas, T. Mueller and W. Seeker(2013-10) (Re)ranking meets morphosyntax: state-of-the-art results from the SPMRL 2013 shared task. Seattle, Washington, USA, pp. 135–145. External Links: Link Cited by: 3, 6.
[3] B. Bohnet(2010) Top accuracy and fast dependency parsing is not a contradiction. pp. 89–97. Cited by: 6.
[4] S. Buchholz and E. Marsi(2006) CoNLL-x shared task on multilingual dependency parsing. pp. 149–164. Cited by: 5.
[5] E. Charniak and M. Johnson(2005) Coarse-to-fine n-best parsing and maxent discriminative reranking. pp. 173–180. Cited by: 2, 4.
[6] M. Collins(2000) Discriminative reranking for natural language parsing. ICML ’00, pp. 175–182. Cited by: 1, 2, 4.
[7] K. Crammer and Y. Singer(2003) Ultraconservative online algorithms for multiclass problems. The Journal of Machine Learning Research 3, pp. 951–991. Cited by: 3.3.
[8] H. Daumé III, J. Langford and D. Marcu(2009) Search-based structured prediction. Machine learning 75 (3), pp. 297–325. Cited by: 2.
[9] S. Geman and D. Geman(1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on (6), pp. 721–741. Cited by: 3.2.
[10] Y. Goldberg and M. Elhadad(2010) An efficient algorithm for easy-first non-directional dependency parsing. pp. 742–750. Cited by: 6.
[11] B. Haddow, A. Arun and P. Koehn(2011) SampleRank training for phrase-based machine translation. pp. 261–271. Cited by: 2.
[12] L. Huang(2008) Forest reranking: discriminative parsing with non-local features.. pp. 586–594. Cited by: 2, 4.
[13] T. Koo and M. Collins(2010) Efficient third-order dependency parsers. pp. 1–11. Cited by: 5.
[14] T. Koo, A. M. Rush, M. Collins, T. Jaakkola and D. Sontag(2010) Dual decomposition for parsing with non-projective head automata. pp. 1288–1298. Cited by: 2, 3.1, 2.
[15] Q. Li, J. Wang, Z. Tu and D. P. Wipf(2013) Fixed-point model for structured labeling. pp. 214–221. Cited by: 2.
[16] A. F. Martins, M. B. Almeida and N. A. Smith(2013) Turning on the turbo: fast third-order non-projective turbo parsers. Cited by: 1, 4, 2, 5, 5, 6.
[17] A. F. Martins, N. A. Smith, P. M. Aguiar and M. A. Figueiredo(2011) Dual decomposition with many overlapping components. pp. 238–249. Cited by: 4, 1, 2, 3.1, 4, 2, 6.
[18] A. F. Martins, N. A. Smith, E. P. Xing, P. M. Aguiar and M. A. Figueiredo(2010) Turbo parsers: dependency parsing by approximate variational inference. pp. 34–44. Cited by: 2.
[19] A. F. Martins, N. A. Smith and E. P. Xing(2009) Concise integer linear programming formulations for dependency parsing. pp. 342–350. Cited by: 2.
[20] Y. Marton, N. Habash, O. Rambow and S. Alkhulani(2013) SPMRLâ13 shared task system: the cadim arabic dependency parser. pp. 76–80. Cited by: 5, 3.
[21] R. McDonald, F. Pereira, K. Ribarov and J. Hajic(2005) Non-projective dependency parsing using spanning tree algorithms. pp. 523–530. Cited by: 2.
[22] R. McDonald, K. Lerman and F. Pereira(2006) Multilingual dependency analysis with a two-stage discriminative parser. pp. 216–220. Cited by: 2.
[23] R. T. McDonald and F. C. Pereira(2006) Online learning of approximate dependency parsing algorithms.. Cited by: 3.1, 3.2, 4.
[24] T. Nakagawa(2007) Multilingual dependency parsing using global features.. pp. 952–956. Cited by: 2.
[25] J. Nivre, J. Hall, J. Nilsson, G. EryiÇ§it and S. Marinov(2006) Labeled pseudo-projective dependency parsing with support vector machines. pp. 221–225. Cited by: 4, 2.
[26] V. Punyakanok, D. Roth, W. Yih and D. Zimak(2004) Semantic role labeling via integer linear programming inference. pp. 1346. Cited by: 2.
[27] A. M. Rush and S. Petrov(2012) Vine pruning for efficient multi-pass dependency parsing. pp. 498–507. Cited by: 1, 2, 2.
[28] D. Seddah, R. Tsarfaty, S. Kübler, M. Candito, J. D. Choi, R. Farkas, J. Foster, I. Goenaga, K. G. Galletebeitia and Y. Goldberg(2013) Overview of the spmrl 2013 shared task: a cross-framework evaluation of parsing morphologically rich languages. pp. 146–182. Cited by: 1, 5, 6.
[29] D. Sontag, A. Globerson and T. Jaakkola(2011) Introduction to dual decomposition for inference. Optimization for Machine Learning, pp. 219–254. Cited by: 2.
[30] M. Surdeanu, R. Johansson, A. Meyers, L. Màrquez and J. Nivre(2008) The conll-2008 shared task on joint parsing of syntactic and semantic dependencies. pp. 159–177. Cited by: 5.
[31] K. Toutanova, D. Klein, C. D. Manning and Y. Singer(2003) Feature-rich part-of-speech tagging with a cyclic dependency network. pp. 173–180. Cited by: 4.
[32] M. L. Wick, K. Rohanimanesh, K. Bellare, A. Culotta and A. McCallum(2011) SampleRank: training factor graphs with atomic gradients. pp. 777–784. Cited by: 1, 2, 3.3.
[33] D. B. Wilson(1996) Generating random spanning trees more quickly than the cover time. pp. 296–303. Cited by: 1, 2, 3.2.2.
[34] H. Zhang and R. McDonald(2012) Generalized higher-order dependency parsing with cube pruning. pp. 320–331. Cited by: 2, 2.
[35] H. Zhang, L. H. K. Zhao and R. McDonald(2013) Online learning for inexact hypergraph search. Cited by: 2.
[36] Y. Zhang and J. Nivre(2011) Transition-based dependency parsing with rich non-local features. pp. 188–193. Cited by: 2.

Generated on Tue Jun 10 17:19:51 2014 by LaTeXML [LOGO]