Low-Rank Tensors for Scoring Dependency Structures

Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
{taolei, yuxin, yuanzh, regina, tommi}@csail.mit.edu

Abstract

Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations. A small subset of such features is often selected manually. This is problematic when features lack clear linguistic meaning as in embeddings or when the information is blended across features. In this paper, we use tensors to map high-dimensional feature vectors into low dimensional representations. We explicitly maintain the parameters as a low-rank tensor to obtain low dimensional representations of words in their syntactic roles, and to leverage modularity in the tensor for easy training with online algorithms. Our parser consistently outperforms the Turbo and MST parsers across 14 different languages. We also obtain the best published UAS results on 5 languages.¹¹Our code is available at https://github.com/taolei87/RBGParser.

1 Introduction

Finding an expressive representation of input sentences is crucial for accurate parsing. Syntactic relations manifest themselves in a broad range of surface indicators, ranging from morphological to lexical, including positional and part-of-speech (POS) tagging features. Traditionally, parsing research has focused on modeling the direct connection between the features and the predicted syntactic relations such as head-modifier (arc) relations in dependency parsing. Even in the case of first-order parsers, this results in a high-dimensional vector representation of each arc. Discrete features, and their cross products, can be further complemented with auxiliary information about words participating in an arc, such as continuous vector representations of words. The exploding dimensionality of rich feature vectors must then be balanced with the difficulty of effectively learning the associated parameters from limited training data.

A predominant way to counter the high dimensionality of features is to manually design or select a meaningful set of feature templates, which are used to generate different types of features [27, 16, 22]. Direct manual selection may be problematic for two reasons. First, features may lack clear linguistic interpretation as in distributional features or continuous vector embeddings of words. Second, designing a small subset of templates (and features) is challenging when the relevant linguistic information is distributed across the features. For instance, morphological properties are closely tied to part-of-speech tags, which in turn relate to positional features. These features are not redundant. Therefore, we may suffer a performance loss if we select only a small subset of the features. On the other hand, by including all the rich features, we face over-fitting problems.

We depart from this view and leverage high-dimensional feature vectors by mapping them into low dimensional representations. We begin by representing high-dimensional feature vectors as multi-way cross-products of smaller feature vectors that represent words and their syntactic relations (arcs). The associated parameters are viewed as a tensor (multi-way array) of low rank, and optimized for parsing performance. By explicitly representing the tensor in a low-rank form, we have direct control over the effective dimensionality of the set of parameters. We obtain role-dependent low-dimensional representations for words (head, modifier) that are specifically tailored for parsing accuracy, and use standard online algorithms for optimizing the low-rank tensor components.

The overall approach has clear linguistic and computational advantages:

•

Our low dimensional embeddings are tailored to the syntactic context of words (head, modifier). This low dimensional syntactic abstraction can be thought of as a proxy to manually constructed POS tags.
•

By automatically selecting a small number of dimensions useful for parsing, we can leverage a wide array of (correlated) features. Unlike parsers such as MST, we can easily benefit from auxiliary information (e.g., word vectors) appended as features.

We implement the low-rank factorization model in the context of first- and third-order dependency parsing. The model was evaluated on 14 languages, using dependency data from CoNLL 2008 and CoNLL 2006. We compare our results against the MST [27] and Turbo [22] parsers. The low-rank parser achieves average performance of 89.08% across 14 languages, compared to 88.73% for the Turbo parser, and 87.19% for MST. The power of the low-rank model becomes evident in the absence of any part-of-speech tags. For instance, on the English dataset, the low-rank model trained without POS tags achieves 90.49% on first-order parsing, while the baseline gets 86.70% if trained under the same conditions, and 90.58% if trained with 12 core POS tags. Finally, we demonstrate that the model can successfully leverage word vector representations, in contrast to the baselines.

2 Related Work

Selecting Features for Dependency Parsing

A great deal of parsing research has been dedicated to feature engineering [18, 25, 26]. While in most state-of-the-art parsers, features are selected manually [27, 29, 16, 22, 44, 35], automatic feature selection methods are gaining popularity [23, 1, 31, 2]. Following standard machine learning practices, these algorithms iteratively select a subset of features by optimizing parsing performance on a development set. These feature selection methods are particularly promising in parsing scenarios where the optimal feature set is likely to be a small subset of the original set of candidate features. Our technique, in contrast, is suitable for cases where the relevant information is distributed across a larger set of related features.

Embedding for Dependency Parsing

A lot of recent work has been done on mapping words into vector spaces [8, 41, 11, 30]. Traditionally, these vector representations have been derived primarily from co-occurrences of words within sentences, ignoring syntactic roles of the co-occurring words. Nevertheless, any such word-level representation can be used to offset inherent sparsity problems associated with full lexicalization [5]. In this sense they perform a role similar to POS tags.

Word-level vector space embeddings have so far had limited impact on parsing performance. From a computational perspective, adding non-sparse vectors directly as features, including their combinations, can significantly increase the number of active features for scoring syntactic structures (e.g., dependency arc). Because of this issue, Cirik and Şensoy (2013) used word vectors only as unigram features (without combinations) as part of a shift reduce parser [32]. The improvement on the overall parsing performance was marginal. Another application of word vectors is compositional vector grammar [36]. While this method learns to map word combinations into vectors, it builds on existing word-level vector representations. In contrast, we represent words as vectors in a manner that is directly optimized for parsing. This framework enables us to learn new syntactically guided embeddings while also leveraging separately estimated word vectors as starting features, leading to improved parsing performance.

Dimensionality Reduction

Many machine learning problems can be cast as matrix problems where the matrix represents a set of co-varying parameters. Such problems include, for example, multi-task learning and collaborative filtering. Rather than assuming that each parameter can be set independently of others, it is helpful to assume that the parameters vary in a low dimensional subspace that has to be estimated together with the parameters. In terms of the parameter matrix, this corresponds to a low-rank assumption. Low-rank constraints are commonly used for improving generalization [19, 37, 38, 12]

A strict low-rank assumption can be restrictive. Indeed, recent approaches to matrix problems decompose the parameter matrix as a sum of low-rank and sparse matrices [40, 47]. The sparse matrix is used to highlight a small number of parameters that should vary independently even if most of them lie on a low-dimensional subspace [42, 4]. We follow this decomposition while extending the parameter matrix into a tensor.

Tensors are multi-way generalizations of matrices and possess an analogous notion of rank. Tensors are increasingly used as tools in spectral estimation [15], including in parsing [6] and other NLP problems [10], where the goal is to avoid local optima in maximum likelihood estimation. In contrast, we expand features for parsing into a multi-way tensor, and operate with an explicit low-rank representation of the associated parameter tensor. The explicit representation sidesteps inherent complexity problems associated with the tensor rank [14]. Our parameters are divided into a sparse set corresponding to manually chosen MST or Turbo parser features and a larger set governed by a low-rank tensor.

3 Problem Formulation

We will commence here by casting first-order dependency parsing as a tensor estimation problem. We will start by introducing the notation used in the paper, followed by a more formal description of our dependency parsing task.

3.1 Basic Notations

Let $A\in\mathbb{R}^{n\times n\times d}$ be a 3-dimensional tensor (a 3-way array). We denote each element of the tensor as $A_{i,j,k}$ where $i\in[n],j\in[n],k\in[d]$ and $[n]$ is a shorthand for the set of integers $\{1,2,\cdots,n\}$ . Similarly, we use $M_{i,j}$ and $u_{i}$ to represent the elements of matrix $M$ and vector $u$ , respectively.

We define the inner product of two tensors (or matrices) as $\left<A,B\right>=\text{vec}(A)^{T}\text{vec}(B)$ , where $\text{vec}(\cdot)$ concatenates the tensor (or matrix) elements into a column vector. The squared norm of a tensor/matrix is denoted by $\|A\|^{2}=\left<A,A\right>$ .

The Kronecker product of three vectors is denoted by $u\otimes v\otimes w$ and forms a rank-1 tensor such that

(u\otimes v\otimes w)_{i,j,k}=u_{i}v_{j}w_{k}.

Note that the vectors $u$ , $v$ , and $w$ may be column or row vectors. Their orientation is defined based on usage. For example, $u\otimes v$ is a rank-1 matrix $uv^{T}$ when $u$ and $v$ are column vectors ( $u^{T}v$ if they are row vectors).

We say that tensor $A$ is in Kruskal form if

\displaystyle A

\displaystyle=\sum_{i=1}^{r}U(i,:)\otimes V(i,:)\otimes W(i,:)

(1)

where $U,V\in\mathbb{R}^{r\times n}$ , $W\in\mathbb{R}^{r\times d}$ and $U(i,:)$ is the $i^{th}$ row of matrix $U$ . We will directly learn a low-rank tensor $A$ (because $r$ is small) in this form as one of our model parameters.

3.2 Dependency Parsing

Let $x$ be a sentence and $\mathcal{Y}(x)$ the set of possible dependency trees over the words in $x$ . We assume that the score $S(x,y)$ of each candidate dependency tree $y\in\mathcal{Y}(x)$ decomposes into a sum of “local” scores for arcs. Specifically:

\displaystyle S(x,y)=\sum_{h\rightarrow m\ \in\ y}s(h\rightarrow m)\quad\quad% \forall y\in\mathcal{Y}(x)

where $h\rightarrow m$ is the head-modifier dependency arc in the tree $y$ . Each $y$ is understood as a collection of arcs $h\rightarrow m$ where $h$ and $m$ index words in $x$ .²²Note that in the case of high-order parsing, the sum $S(x,y)$ may also include local scores for other syntactic structures, such as grandhead-head-modifier score $s(g\rightarrow h\rightarrow m)$ . See [22] for a complete list of these structures. For example, $x(h)$ is the word corresponding to $h$ . We suppress the dependence on $x$ whenever it is clear from context. For example, $s(h\rightarrow m)$ can depend on $x$ in complicated ways as discussed below. The predicted parse is obtained as $\hat{y}=\mbox{arg}\max_{y\in\mathcal{Y}(x)}S(x,y)$ .

A key problem is how we parameterize the arc scores $s(h\rightarrow m)$ . Following the MST parser [27] we can define rich features characterizing each head-modifier arc, compiled into a sparse binary vector $\phi_{h\rightarrow m}\in\mathbb{R}^{L}$ that depends on the sentence $x$ as well as the chosen arc $h\rightarrow m$ (again, we suppress the dependence on $x$ ). Based on this feature representation, we define the score of each arc as $s_{\theta}(h\rightarrow m)=\langle\theta,\phi_{h\rightarrow m}\rangle$ where $\theta\in\mathbb{R}^{L}$ represent adjustable parameters to be learned, and $L$ is the number of parameters (and possible features in $\phi_{h\rightarrow m}$ ).

Unigram features:
form	form-p	form-n
lemma	lemma-p	lemma-n
pos	pos-p	pos-n
morph	bias
Bigram features:
pos-p, pos
pos, pos-n
pos, lemma
morph, lemma
Trigram features:
pos-p, pos, pos-n

Table 1: Word feature templates used by our model. pos, form, lemma and morph stand for the fine POS tag, word form, word lemma and the morphology feature (provided in CoNLL format file) of the current word. There is a bias term that is always active for any word. The suffixes -p and -n refer to the left and right of the current word respectively. For example, pos-p means the POS tag to the left of the current word in the sentence.

We can alternatively specify arc features in terms of rank-1 tensors by taking the Kronecker product of simpler feature vectors associated with the head (vector $\phi_{h}\in\mathbb{R}^{n}$ ), and modifier (vector $\phi_{m}\in\mathbb{R}^{n}$ ), as well as the arc itself (vector $\phi_{h,m}\in\mathbb{R}^{d}$ ). Here $\phi_{h,m}$ is much lower dimensional than the MST arc feature vector $\phi_{h\rightarrow m}$ discussed earlier. For example, $\phi_{h,m}$ may be composed of only indicators for binned arc lengths³³In our current version, $\phi_{h,m}$ only contains the binned arc length. Other possible features include, for example, the label of the arc $h\rightarrow m$ , the POS tags between the head and the modifier, boolean flags which indicate the occurence of in-between punctutations or conjunctions, etc.. $\phi_{h}$ and $\phi_{m}$ , on the other hand, are built from features shown in Table 1. By taking the cross-product of all these component feature vectors, we obtain the full feature representation for arc $h\rightarrow m$ as a rank-1 tensor

\phi_{h}\otimes\phi_{m}\otimes\phi_{h,m}\ \in\ \mathbb{R}^{n\times n\times d}

Note that elements of this rank-1 tensor include feature combinations that are not part of the feature crossings in $\phi_{h\rightarrow m}$ . In this sense, the rank-1 tensor represents a substantial feature expansion. The arc score $s_{tensor}(h\rightarrow m)$ associated with the tensor representation is defined analogously as

s_{tensor}(h\rightarrow m)=\left<A,\phi_{h}\otimes\phi_{m}\otimes\phi_{h,m}\right>

where the adjustable parameters $A$ also form a tensor. Given the typical dimensions of the component feature vectors, $\phi_{h}$ , $\phi_{m}$ , $\phi_{h,m}$ , it is not even possible to store all the parameters in $A$ . Indeed, in the full English training set of CoNLL-2008, the tensor involves around $8\times 10^{11}$ entries while the MST feature vector has approximately $1.5\times 10^{7}$ features. To counter this feature explosion, we restrict the parameters $A$ to have low rank.

Low-Rank Dependency Scoring

We can represent a rank-r tensor $A$ explicitly in terms of parameter matrices $U$ , $V$ , and $W$ as shown in Eq. 1. As a result, the arc score for the tensor reduces to evaluating $U\phi_{h}$ , $V\phi_{m}$ , and $W\phi_{h,m}$ which are all $r$ dimensional vectors and can be computed efficiently based on any sparse vectors $\phi_{h}$ , $\phi_{m}$ , and $\phi_{h,m}$ . The resulting arc score $s_{tensor}(h\rightarrow m)$ is then

\displaystyle\sum_{i=1}^{r}[U\phi_{h}]_{i}[V\phi_{m}]_{i}[W\phi_{h,m}]_{i}

(2)

By learning parameters $U$ , $V$ , and $W$ that function well in dependency parsing, we also learn context-dependent embeddings for words and arcs. Specifically, $U\phi_{h}$ (for a given sentence, suppressed) is an $r$ dimensional vector representation of the word corresponding to $h$ as a head word. Similarly, $V\phi_{m}$ provides an analogous representation for a modifier $m$ . Finally, $W\phi_{h,m}$ is a vector embedding of the supplemental arc-dependent information. The resulting embedding is therefore tied to the syntactic roles of the words (and arcs), and learned in order to perform well in parsing.

We expect a dependency parsing model to benefit from several aspects of the low-rank tensor scoring. For example, we can easily incorporate additional useful features in the feature vectors $\phi_{h}$ , $\phi_{m}$ and $\phi_{h,m}$ , since the low-rank assumption (for small enough $r$ ) effectively counters the otherwise uncontrolled feature expansion. Moreover, by controlling the amount of information we can extract from each of the component feature vectors (via rank $r$ ), the statistical estimation problem does not scale dramatically with the dimensions of $\phi_{h}$ , $\phi_{m}$ and $\phi_{h,m}$ . In particular, the low-rank constraint can help generalize to unseen arcs. Consider a feature $\delta(x(h)=a)\cdot\delta(x(m)=b)\cdot\delta(dis(x,h,m)=c)$ which is non-zero only for an arc $a\rightarrow b$ with distance $c$ in sentence $x$ . If the arc has not been seen in the available training data, it does not contribute to the traditional arc score $s_{\theta}(\cdot)$ . In contrast, with the low-rank constraint, the arc score in Eq. 2 would typically be non-zero.

Combined Scoring

Our parsing model aims to combine the strengths of both traditional features from the MST/Turbo parser as well as the new low-rank tensor features. In this way, our model is able to capture a wide range of information including the auxiliary features without having uncontrolled feature explosion, while still having the full accessibility to the manually engineered features that are proven useful. Specifically, we define the arc score $s_{\gamma}(h\rightarrow m)$ as the combination

		$\displaystyle(1-\gamma)s_{tensor}(h\rightarrow m)+\gamma s_{\theta}(h% \rightarrow m)$
		$\displaystyle=(1-\gamma)\sum_{i=1}^{r}[U\phi_{h}]_{i}[V\phi_{m}]_{i}[W\phi_{h,% m}]_{i}$
		$\displaystyle+\ \gamma\left<\theta,\phi_{h\rightarrow m}\right>$		(3)

where $\theta\in\mathbb{R}^{L}$ , $U\in\mathbb{R}^{r\times n}$ , $V\in\mathbb{R}^{r\times n}$ , and $W\in\mathbb{R}^{r\times d}$ are the model parameters to be learned. The rank $r$ and $\gamma\in[0,1]$ (balancing the two scores) represent hyper-parameters in our model.

4 Learning

The training set $\mathit{D}=\{(\hat{x}_{i},\hat{y}_{i})\}_{i=1}^{N}$ consists of $N$ pairs, where each pair consists of a sentence $x_{i}$ and the corresponding gold (target) parse $y_{i}$ . The goal is to learn values for the parameters $\theta$ , $U$ , $V$ and $W$ that optimize the combined scoring function $S_{\gamma}(x,y)=\sum_{h\rightarrow m\in y}s_{\gamma}(h\rightarrow m)$ , defined in Eq. 3, for parsing performance. We adopt a maximum soft-margin framework for this learning problem. Specifically, we find parameters $\theta$ , $U$ , $V$ , $W$ , and $\{\xi_{i}\}$ that minimize

		$\displaystyle C\sum_{i}\xi_{i}+\\|\theta\\|^{2}+\\|U\\|^{2}+\\|V\\|^{2}+\\|W\\|^{2}$
		$\displaystyle\;\text{s.t.}\;\;\;S_{\gamma}(\hat{x}_{i},\hat{y}_{i})\geq S_{% \gamma}(\hat{x}_{i},y_{i})+\\|\hat{y}_{i}-y_{i}\\|_{1}-\xi_{i}$
		$\displaystyle\;\;\;\;\;\;\;\;\forall y_{i}\in\mathcal{Y}(\hat{x}_{i}),\;% \forall i.$		(4)

where $\|\hat{y}_{i}-y_{i}\|_{1}$ is the number of mismatched arcs between the two trees, and $\xi_{i}$ is a non-negative slack variable. The constraints serve to separate the gold tree from other alternatives in $\mathcal{Y}(\hat{x}_{i})$ with a margin that increases with distance.

The objective as stated is not jointly convex with respect to $U$ , $V$ and $W$ due to our explicit representation of the low-rank tensor. However, if we fix any two sets of parameters, for example, if we fix $V$ and $W$ , then the combined score $S_{\gamma}(x,y)$ will be a linear function of both $\theta$ and $U$ . As a result, the objective will be jointly convex with respect to $\theta$ and $U$ and could be optimized using standard tools. However, to accelerate learning, we adopt an online learning setup. Specifically, we use the passive-aggressive learning algorithm [9] tailored to our setting, updating pairs of parameter sets, $(\theta,U)$ , $(\theta,V)$ and $(\theta,W)$ in an alternating manner. This method is described below.

Online Learning

In an online learning setup, we update parameters successively based on each sentence. In order to apply the passive-aggressive algorithm, we fix two of $U$ , $V$ and $W$ (say, for example, $V$ and $W$ ) in an alternating manner, and apply a closed-form update to the remaining parameters (here $U$ and $\theta$ ). This is possible since the objective function with respect to $(\theta,U)$ has a similar form as in the original passive-aggressive algorithm. To illustrate this, consider a training sentence $x_{i}$ . The update involves finding first the best competing tree,

\displaystyle\tilde{y}_{i}=\operatorname*{arg\,max}_{y_{i}\in\mathcal{Y}(\hat{% x}_{i})}S_{\gamma}(\hat{x}_{i},y_{i})+\|\hat{y}_{i}-y_{i}\|_{1}

(5)

which is the tree that violates the constraint in Eq. 4 most (i.e. maximizes the loss $\xi_{i}$ ). We then obtain parameter increments $\Delta\theta$ and $\Delta U$ by solving

		$\displaystyle\min_{\Delta{\theta},\ \Delta{U},\ \xi\geq 0}\quad\frac{1}{2}\\|% \Delta{\theta}\\|^{2}+\frac{1}{2}\\|\Delta{U}\\|^{2}+C\xi$
		$\displaystyle\text{s.t.}\quad S_{\gamma}(\hat{x}_{i},\hat{y}_{i})\geq S_{% \gamma}(\hat{x}_{i},\tilde{y}_{i})+\\|\hat{y}_{i}-\tilde{y}_{i}\\|_{1}-\xi$

In this way, the optimization problem attempts to keep the parameter change as small as possible, while forcing it to achieve mostly zero loss on this single instance. This problem has a closed form solution

		$\displaystyle\Delta{\theta}=\min\left\{C,\frac{\text{loss}}{\gamma^{2}\\|d% \theta\\|^{2}+(1-\gamma)^{2}\\|du\\|^{2}}\right\}\gamma d\theta$
		$\displaystyle\Delta{U}=\min\left\{C,\frac{\text{loss}}{\gamma^{2}\\|d\theta\\|^{% 2}+(1-\gamma)^{2}\\|du\\|^{2}}\right\}(1-\gamma)du$

where

loss	$\displaystyle=$	$\displaystyle S_{\gamma}(\hat{x}_{i},\tilde{y}_{i})+\\|\hat{y}_{i}-\tilde{y}_{i% }\\|_{1}-S_{\gamma}(\hat{x}_{i},\hat{y}_{i})$
$\displaystyle d\theta$	$\displaystyle=$	$\displaystyle\sum_{h\rightarrow m\ \in\ \hat{y}_{i}}\phi_{h\rightarrow m}-\sum% _{h\rightarrow m\ \in\ \tilde{y}_{i}}\phi_{h\rightarrow m}$
$\displaystyle du$	$\displaystyle=$	$\displaystyle\sum_{h\rightarrow m\ \in\ \hat{y}_{i}}\left[(V\phi_{m})\odot(W% \phi_{h,m})\right]\otimes\phi_{h}$
	$\displaystyle-$	$\displaystyle\sum_{h\rightarrow m\ \in\ \tilde{y}_{i}}\left[(V\phi_{m})\odot(W% \phi_{h,m})\right]\otimes\phi_{h}$

where $(u\odot v)_{i}=u_{i}v_{i}$ is the Hadamard (element-wise) product. The magnitude of change of $\theta$ and $U$ is controlled by the parameter $C$ . By varying $C$ , we can determine an appropriate step size for the online updates. The updates also illustrate how $\gamma$ balances the effect of the MST component of the score relative to the low-rank tensor score. When $\gamma=0$ , the arc scores are entirely based on the low-rank tensor and $\Delta\theta=0$ . Note that $\phi_{h}$ , $\phi_{m}$ , $\phi_{h,m}$ , and $\phi_{h\rightarrow m}$ are typically very sparse for each word or arc. Therefore $d u$ and $d\theta$ are also sparse and can be computed efficiently.

Initialization

The alternating online algorithm relies on how we initialize $U$ , $V$ , and $W$ since each update is carried out in the context of the other two. A random initialization of these parameters is unlikely to work well, both due to the dimensions involved, and the nature of the alternating updates. We consider here instead a reasonable deterministic “guess” as the initialization method.

We begin by training our model without any low-rank parameters, and obtain parameters $\theta$ . The majority of features in this MST component can be expressed as elements of the feature tensor, i.e., as $[\phi_{h}\otimes\phi_{m}\otimes\phi_{h,m}]_{i,j,k}$ . We can therefore create a tensor representation of $\theta$ such that $B_{i,j,k}$ equals the corresponding parameter value in $\theta$ . We use a low-rank version of $B$ as the initialization. Specifically, we unfold the tensor B into a matrix $B^{(h)}$ of dimensions $n$ and $n d$ , where $n=dim(\phi_{h})=dim(\phi_{m})$ and $d=dim(\phi_{h,m})$ . For instance, a rank-1 tensor can be unfolded as $u\otimes v\otimes w=u\otimes\text{vec}(v\otimes w)$ . We compute the top-r SVD of the resulting unfolded matrix such that $B^{(h)}=P^{T}SQ$ . $U$ is initialized as $P$ . Each right singular vector $S_{i}Q(i,:)$ is also a matrix in $\mathbb{R}^{n\times d}$ . The leading left and right singular vectors of this matrix are assigned to $V(i,:)$ and $W(i,:)$ respectively. In our implementation, we run one epoch of our model without low-rank parameters and initialize the tensor $A$ .

Parameter Averaging

The passive-aggressive algorithm regularizes the increments (e.g. $\Delta\theta$ and $\Delta U$ ) during each update but does not include any overall regularization. In other words, keeping updating the model may lead to large parameter values and over-fitting. To counter this effect, we use parameter averaging as used in the MST and Turbo parsers. The final parameters are those averaged across all the iterations (cf. [7]). For simplicity, in our algorithm we average $U$ , $V$ , $W$ and $\theta$ separately, which works well empirically.

	First-order only				High-order
	Ours	NT-1st	MST	Turbo	Ours-3rd	NT-3rd	MST-2nd	Turbo-3rd	Best Published
Arabic	79.60	78.71	78.3	77.23	79.95	79.53	78.75	79.64	81.12 (Ma11)
Bulgarian	92.30	91.14	90.98	91.76	93.50	92.79	91.56	93.1	94.02 (Zh13)
Chinese	91.43	90.85	90.40	88.49	92.68	92.39	91.77	89.98	91.89 (Ma10)
Czech	87.90	86.62	86.18	87.66	90.50	89.43	87.3	90.32	90.32 (Ma13)
Danish	90.64	89.80	89.84	89.42	91.39	90.82	90.5	91.48	92.00 (Zh13)
Dutch	84.81	83.77	82.89	83.61	86.41	86.08	84.11	86.19	86.19 (Ma13)
English	91.84	91.40	90.59	91.21	93.02	92.82	91.54	93.22	93.22 (Ma13)
German	90.24	89.70	89.54	90.52	91.97	92.26	90.14	92.41	92.41 (Ma13)
Japanese	93.74	93.36	93.38	92.78	93.71	93.23	92.92	93.52	93.72 (Ma11)
Portuguese	90.94	90.67	89.92	91.14	91.92	91.63	91.08	92.69	93.03 (Ko10)
Slovene	84.25	83.15	82.09	82.81	86.24	86.07	83.25	86.01	86.95 (Ma11)
Spanish	85.27	84.95	83.79	83.61	88.00	87.47	84.33	85.59	87.96 (Zh13)
Swedish	89.86	89.66	88.27	89.36	91.00	90.83	89.05	91.14	91.62 (Zh13)
Turkish	75.84	74.89	74.81	75.98	76.84	75.83	74.39	76.9	77.55 (Ko10)
Average	87.76	87.05	86.5	86.83	89.08	88.66	87.19	88.73	89.43

Table 2: First-order parsing (left) and high-order parsing (right) results on CoNLL-2006 datasets and the English dataset of CoNLL-2008. For our model, the experiments are ran with rank

r=50

and hyper-parameter

\gamma=0.3

. To remove the tensor in our model, we ran experiments with

\gamma=1

, corresponding to columns NT-1st and NT-3rd. The last column shows results of most accurate parsers among Nivre et al. (2006), McDonald et al. (2006), Martins et al. (2010), Martins et al. (2011a), Martins et al. (2013), Koo et al. (2010), Rush and Petrov (2012a), Zhang and McDonald (2012a) and Zhang et al. (2013).

5 Experimental Setup

Datasets

We test our dependency model on 14 languages, including the English dataset from CoNLL 2008 shared tasks and all 13 datasets from CoNLL 2006 shared tasks [3, 39]. These datasets include manually annotated dependency trees, POS tags and morphological information. Following standard practices, we encode this information as features.

Methods

We compare our model to MST and Turbo parsers on non-projective dependency parsing. For our parser, we train both a first-order parsing model (as described in Section 3 and 4) as well as a third-order model. The third order parser simply adds high-order features, those typically used in MST and Turbo parsers, into our $s_{\theta}(x,y)=\left<\theta,\phi(x,y)\right>$ scoring component. The decoding algorithm for the third-order parsing is based on [46]. For the Turbo parser, we directly compare with the recent published results in [22]. For the MST parser, we train and test using the most recent version of the code.⁴⁴http://sourceforge.net/projects/mstparser/ In addition, we implemented two additional baselines, NT-1st (first order) and NT-3rd (third order), corresponding to our model without the tensor component.

Features

For the arc feature vector $\phi_{h\rightarrow m}$ , we use the same set of feature templates as MST v0.5.1. For head/modifier vector $\phi_{h}$ and $\phi_{m}$ , we show the complete set of feature templates used by our model in Table 1. Finally, we use a similar set of feature templates as Turbo v2.1 for 3rd order parsing.

To add auxiliary word vector representations, we use the publicly available word vectors [5], learned from raw data [13, 20]. Three languages in our dataset – English, German and Swedish – have corresponding word vectors in this collection.⁵⁵https://github.com/wolet/sprml13-word-embeddings The dimensionality of this representation varies by language: English has 50 dimensional word vectors, while German and Swedish have 25 dimensional word vectors. Each entry of the word vector is added as a feature value into feature vectors $\phi_{h}$ and $\phi_{m}$ . For each word in the sentence, we add its own word vector as well as the vectors of its left and right words.

We should note that since our model parameter $A$ is represented and learned in the low-rank form, we only have to store and maintain the low-rank projections $U\phi_{h}$ , $V\phi_{m}$ and $W\phi_{h,m}$ rather than explicitly calculate the feature tensor $\phi_{h}\otimes\phi_{m}\otimes\phi_{h,m}$ . Therefore updating parameters and decoding a sentence is still efficient, i.e., linear in the number of values of the feature vector. In contrast, assume we take the cross-product of the auxiliary word vector values, POS tags and lexical items of a word and its context, and add the crossed values into a normal model (in $\phi_{h\rightarrow m}$ ). The number of features for each arc would be at least quadratic, growing into thousands, and would be a significant impediment to parsing efficiency.

Evaluation

Following standard practices, we train our full model and the baselines for 10 epochs. As the evaluation measure, we use unlabeled attachment scores (UAS) excluding punctuation. In all the reported experiments, the hyper-parameters are set as follows: $r=50$ (rank of the tensor), $C=1$ for first-order model and $C=0.01$ for third-order model.

6 Results

Overall Performance

Table 2 shows the performance of our model and the baselines on 14 CoNLL datasets. Our model outperforms Turbo parser, MST parser, as well as its own variants without the tensor component. The improvements of our low-rank model are consistent across languages: results for the first order parser are better on 11 out of 14 languages. By comparing NT-1st and NT-3rd (models without low-rank) with our full model (with low-rank), we obtain 0.7% absolute improvement on first-order parsing, and 0.3% improvement on third-order parsing. Our model also achieves the best UAS on 5 languages.

We next focus on the first-order model and gauge the impact of the tensor component. First, we test our model by varying the hyper-parameter $\gamma$ which balances the tensor score and the traditional MST/Turbo score components. Figure 1 shows the average UAS on CoNLL test datasets after each training epoch. We can see that the improvement of adding the low-rank tensor is consistent across various choices of hyper parameter $\gamma$ . When training with the tensor component alone ( $\gamma=0$ ), the model converges more slowly. Learning of the tensor is harder because the scoring function is not linear (nor convex) with respect to parameters $U$ , $V$ and $W$ . However, the tensor scoring component achieves better generalization on the test data, resulting in better UAS than NT-1st after 8 training epochs.

Figure 1: Average UAS on CoNLL testsets after different epochs. Our full model consistently performs better than NT-1st (its variation without tensor component) under different choices of the hyper-parameter

\gamma

To assess the ability of our model to incorporate a range of features, we add unsupervised word vectors to our model. As described in previous section, we do so by appending the values of different coordinates in the word vector into $\phi_{h}$ and $\phi_{m}$ . As Table 3 shows, adding this information increases the parsing performance for all the three languages. For instance, we obtain more than 0.5% absolute improvement on Swedish.

	no word vector	with word vector
English	91.84	92.07
German	90.24	90.48
Swedish	89.86	90.38

Table 3: Results of adding unsupervised word vectors to the tensor. Adding this information yields consistent improvement for all languages.

	Our model		NT-1st
	-POS	+wv.	-POS	+POS
English	88.89	90.49	86.70	90.58
German	82.63	85.80	78.71	88.50
Swedish	81.84	85.90	79.65	88.75

Table 4: The first three columns show parsing results when models are trained without POS tags. The last column gives the upper-bound, i.e. the performance of a parser trained with 12 Core POS tags. The low-rank model outperforms NT-1st by a large margin. Adding word vector features further improves performance.

Syntactic Abstraction without POS

Since our model learns a compressed representation of feature vectors, we are interested to measure its performance when part-of-speech tags are not provided (See Table 4). The rationale is that given all other features, the model would induce representations that play a similar role to POS tags. Note that the performance of traditional parsers drops when tags are not provided. For example, the performance gap is 10% on German. Our experiments show that low-rank parser operates effectively in the absence of tags. In fact, it nearly reaches the performance of the original parser that used the tags on English.

Examples of Derived Projections

We manually analyze low-dimensional projections to assess whether they capture syntactic abstraction. For this purpose, we train a model with only a tensor component (such that it has to learn an accurate tensor) on the English dataset and obtain low dimensional embeddings $U\phi_{w}$ and $V\phi_{w}$ for each word. The two r-dimension vectors are concatenated as an “averaged” vector. We use this vector to calculate the cosine similarity between words. Table 5 shows examples of five closest neighbors of queried words. While these lists include some noise, we can clearly see that the neighbors exhibit similar syntactic behavior. For example, “on” is close to other prepositions. More interestingly, we can consider the impact of syntactic context on the derived projections. The bottom part of Table 5 shows that the neighbors change substantially depending on the syntactic role of the word. For example, the closest words to the word “increase” are verbs in the context phrase “will increase again”, while the closest words become nouns given a different phrase “an increase of”.

greatly	profit		says		on	when
actively	earnings		adds		with	where
openly	franchisees		predicts		into	what
significantly	shares		noted		at	why
outright	revenue		wrote		during	which
substantially	members		contends		over	who
increase		will increase again		an increase of
rise		arguing		gain
advance		be		prices
contest		charging		payment
halt		gone		members
Exchequer		making		subsidiary
hit		attacks hit the		hardest hit is
shed		distributes		monopolies
rallied		stayed		pills
triggered		sang		sophistication
appeared		removed		ventures
understate		eased		factors

Table 5: Five closest neighbors of the queried words (shown in bold). The upper part shows our learned embeddings group words with similar syntactic behavior. The two bottom parts of the table demonstrate that how the projections change depending on the syntactic context of the word.

	#Tok.	Len.	Train. Time (hour)
	#Tok.	Len.	NT-1st	Ours
Arabic	42K	32	0.13	0.22
Chinese	337K	6	0.37	0.65
English	958K	24	1.88	2.83

Table 6: Comparison of training times across three typical datasets. The second column is the number of tokens in each data set. The third column shows the average sentence length. Both first-order models are implemented in Java and run as a single process.

Running Time

Table 6 illustrates the impact of estimating low-rank tensor parameters on the running time of the algorithm. For comparison, we also show the NT-1st times across three typical languages. The Arabic dataset has the longest average sentence length, while the Chinese dataset has the shortest sentence length in CoNLL 2006. Based on these results, estimating a rank-50 tensor together with MST parameters only increases the running time by a factor of 1.7.

7 Conclusions

Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations. We introduce a low-rank factorization method that enables to map high dimensional feature vectors into low dimensional representations. Our method maintains the parameters as a low-rank tensor to obtain low dimensional representations of words in their syntactic roles, and to leverage modularity in the tensor for easy training with online algorithms. We implement the approach on first-order to third-order dependency parsing. Our parser outperforms the Turbo and MST parsers across 14 languages.

Future work involves extending the tensor component to capture higher-order structures. In particular, we would consider second-order structures such as grandparent-head-modifier by increasing the dimensionality of the tensor. This tensor will accordingly be a four or five-way array. The online update algorithm remains applicable since each dimension is optimized in an alternating fashion.

8 Acknowledgements

The authors acknowledge the support of the MURI program (W911NF-10-1-0533) and the DARPA BOLT program. This research is developed in collaboration with the Arabic Language Technoligies (ALT) group at Qatar Computing Research Institute (QCRI) within the lyas project. We thank Volkan Cirik for sharing the unsupervised word vector data. Thanks to Amir Globerson, Andreea Gane, the members of the MIT NLP group and the ACL reviewers for their suggestions and comments. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors, and do not necessarily reflect the views of the funding organizations.

References

[1] M. Ballesteros and J. Nivre(2012) MaltOptimizer: an optimization tool for MaltParser.. Cited by: 2.
[2] M. Ballesteros(2013) Effective morphological feature selection with MaltOptimizer at the SPMRL 2013 shared task. Cited by: 2.
[3] S. Buchholz and E. Marsi(2006) CoNLL-X shared task on multilingual dependency parsing. CoNLL-X ’06. Cited by: 5.
[4] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo and A. S. Willsky(2011) Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization. Cited by: 2.
[5] V. Cirik and H. Şensoy(2013) The AI-KU system at the SPMRL 2013 shared task : unsupervised features for dependency parsing. Cited by: 2, 2, 5.
[6] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster and L. Ungar(2012) Spectral learning of latent-variable PCFGs. Cited by: 2.
[7] M. Collins(2002) Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. EMNLP ’02. Cited by: 4.
[8] R. Collobert and J. Weston(2008) A unified architecture for natural language processing: deep neural networks with multitask learning. Cited by: 2.
[9] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz and Y. Singer(2006) Online passive-aggressive algorithms. The Journal of Machine Learning Research. Cited by: 4.
[10] T. V. de Cruys, T. Poibeau and A. Korhonen(2013) A tensor-based factorization model of semantic compositionality.. Cited by: 2.
[11] P. S. Dhillon, D. Foster and L. Ungar(2011) Multiview learning of word embeddings via CCA. Cited by: 2.
[12] A. Evgeniou and M. Pontil(2007) Multi-task feature learning. Cited by: 2.
[13] A. Globerson, G. Chechik, F. Pereira and N. Tishby(2007) Euclidean embedding of co-occurrence data. Journal of Machine Learning Research. Cited by: 5.
[14] C. Hillar and L. Lim(2009) Most tensor problems are NP-hard. arXiv preprint arXiv:0911.1393. Cited by: 2.
[15] D. Hsu and S. M. Kakade(2013) Learning mixtures of spherical gaussians: moment methods and spectral decompositions. Cited by: 2.
[16] T. Koo and M. Collins(2010) Efficient third-order dependency parsers. ACL ’10. Cited by: 1, 2.
[17] T. Koo, A. M. Rush, M. Collins, T. Jaakkola and D. Sontag(2010) Dual decomposition for parsing with non-projective head automata. Cited by: 2.
[18] A. Lazaridou, E. M. Vecchi and M. Baroni(2013) Fish transporters and miracle homes: how compositional distributional semantics can help NP parsing. Cited by: 2.
[19] D. D. Lee and H. S. Seung(1999) Learning the parts of objects by non-negative matrix factorization. Nature. Cited by: 2.
[20] Y. Maron, M. Lamar and E. Bienenstock(2010) Sphere embedding: an application to part-of-speech induction.. Cited by: 5.
[21] A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar and M. A. T. Figueiredo(2011) Dual decomposition with many overlapping components. EMNLP ’11. Cited by: 2.
[22] A. F. Martins, M. B. Almeida and N. A. Smith(2013) Turning on the turbo: fast third-order non-projective turbo parsers. Cited by: 1, 1, 2, 3.2, 2, 5.
[23] A. F. Martins, N. A. Smith, P. M. Aguiar and M. A. Figueiredo(2011) Structured sparsity in structured prediction. Cited by: 2.
[24] A. F. Martins, N. A. Smith, E. P. Xing, P. M. Aguiar and M. A. Figueiredo(2010) Turbo parsers: dependency parsing by approximate variational inference. Cited by: 2.
[25] Y. Marton, N. Habash and O. Rambow(2010) Improving arabic dependency parsing with lexical and inflectional morphological features. SPMRL ’10. Cited by: 2.
[26] Y. Marton, N. Habash and O. Rambow(2011) Improving arabic dependency parsing with form-based and functional morphological features. Cited by: 2.
[27] R. McDonald, K. Crammer and F. Pereira(2005) Online large-margin training of dependency parsers. Cited by: 1, 1, 2, 3.2.
[28] R. McDonald, K. Lerman and F. Pereira(2006) Multilingual dependency analysis with a two-stage discriminative parser. Cited by: 2.
[29] R. McDonald, F. Pereira, K. Ribarov and J. Hajič(2005) Non-projective dependency parsing using spanning tree algorithms. Cited by: 2.
[30] T. Mikolov, K. Chen, G. Corrado and J. Dean(2013) Efficient estimation of word representations in vector space. CoRR. Cited by: 2.
[31] P. Nilsson and P. Nugues(2010) Automatic discovery of feature sets for dependency parsing. Cited by: 2.
[32] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov and E. Marsi(2007) MaltParser: a language-independent system for data-driven dependency parsing. Natural Language Engineering. Cited by: 2.
[33] J. Nivre, J. Hall, J. Nilsson, G. EryiÇ§it and S. Marinov(2006) Labeled pseudo-projective dependency parsing with support vector machines. Cited by: 2.
[34] A. M. Rush and S. Petrov(2012) Vine pruning for efficient multi-pass dependency parsing. Cited by: 2.
[35] A. Rush and S. Petrov(2012) Vine pruning for efficient multi-pass dependency parsing. Cited by: 2.
[36] R. Socher, J. Bauer, C. D. Manning and A. Y. Ng(2013) Parsing with compositional vector grammars. Cited by: 2.
[37] N. Srebro and T. Jaakkola(2003) Weighted low-rank approximations. Cited by: 2.
[38] N. Srebro, J. Rennie and T. S. Jaakkola(2004) Maximum-margin matrix factorization. Cited by: 2.
[39] M. Surdeanu, R. Johansson, A. Meyers, L. Màrquez and J. Nivre(2008) The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. CoNLL ’08. Cited by: 5.
[40] M. Tao and X. Yuan(2011) Recovering low-rank and sparse components of matrices from incomplete and noisy observations. SIAM Journal on Optimization. Cited by: 2.
[41] J. Turian, L. Ratinov and Y. Bengio(2010) Word representations: a simple and general method for semi-supervised learning. ACL ’10. Cited by: 2.
[42] A. E. Waters, A. C. Sankaranarayanan and R. Baraniuk(2011) SpaRCS: recovering low-rank and sparse matrices from compressive measurements. Cited by: 2.
[43] H. Zhang and R. McDonald(2012) Generalized higher-order dependency parsing with cube pruning. Cited by: 2.
[44] H. Zhang and R. McDonald(2012) Generalized higher-order dependency parsing with cube pruning. EMNLP-CoNLL ’12. Cited by: 2.
[45] H. Zhang, L. H. K. Zhao and R. McDonald(2013) Online learning for inexact hypergraph search. Cited by: 2.
[46] Y. Zhang, T. Lei, R. Barzilay, T. Jaakkola and A. Globerson(2014) Steps to excellence: simple inference with refined scoring of dependency trees. Cited by: 5.
[47] T. Zhou and D. Tao(2011) Godec: randomized low-rank & sparse matrix decomposition in noisy case. Cited by: 2.

Generated on Wed Jun 11 19:03:26 2014 by LaTeXML [LOGO]