Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations. A small subset of such features is often selected manually. This is problematic when features lack clear linguistic meaning as in embeddings or when the information is blended across features. In this paper, we use tensors to map high-dimensional feature vectors into low dimensional representations. We explicitly maintain the parameters as a low-rank tensor to obtain low dimensional representations of words in their syntactic roles, and to leverage modularity in the tensor for easy training with online algorithms. Our parser consistently outperforms the Turbo and MST parsers across 14 different languages. We also obtain the best published UAS results on 5 languages.11Our code is available at https://github.com/taolei87/RBGParser.
Finding an expressive representation of input sentences is crucial for accurate parsing. Syntactic relations manifest themselves in a broad range of surface indicators, ranging from morphological to lexical, including positional and part-of-speech (POS) tagging features. Traditionally, parsing research has focused on modeling the direct connection between the features and the predicted syntactic relations such as head-modifier (arc) relations in dependency parsing. Even in the case of first-order parsers, this results in a high-dimensional vector representation of each arc. Discrete features, and their cross products, can be further complemented with auxiliary information about words participating in an arc, such as continuous vector representations of words. The exploding dimensionality of rich feature vectors must then be balanced with the difficulty of effectively learning the associated parameters from limited training data.
A predominant way to counter the high dimensionality of features is to manually design or select a meaningful set of feature templates, which are used to generate different types of features [27, 16, 22]. Direct manual selection may be problematic for two reasons. First, features may lack clear linguistic interpretation as in distributional features or continuous vector embeddings of words. Second, designing a small subset of templates (and features) is challenging when the relevant linguistic information is distributed across the features. For instance, morphological properties are closely tied to part-of-speech tags, which in turn relate to positional features. These features are not redundant. Therefore, we may suffer a performance loss if we select only a small subset of the features. On the other hand, by including all the rich features, we face over-fitting problems.
We depart from this view and leverage high-dimensional feature vectors by mapping them into low dimensional representations. We begin by representing high-dimensional feature vectors as multi-way cross-products of smaller feature vectors that represent words and their syntactic relations (arcs). The associated parameters are viewed as a tensor (multi-way array) of low rank, and optimized for parsing performance. By explicitly representing the tensor in a low-rank form, we have direct control over the effective dimensionality of the set of parameters. We obtain role-dependent low-dimensional representations for words (head, modifier) that are specifically tailored for parsing accuracy, and use standard online algorithms for optimizing the low-rank tensor components.
The overall approach has clear linguistic and computational advantages:
Our low dimensional embeddings are tailored to the syntactic context of words (head, modifier). This low dimensional syntactic abstraction can be thought of as a proxy to manually constructed POS tags.
By automatically selecting a small number of dimensions useful for parsing, we can leverage a wide array of (correlated) features. Unlike parsers such as MST, we can easily benefit from auxiliary information (e.g., word vectors) appended as features.
We implement the low-rank factorization model in the context of first- and third-order dependency parsing. The model was evaluated on 14 languages, using dependency data from CoNLL 2008 and CoNLL 2006. We compare our results against the MST [27] and Turbo [22] parsers. The low-rank parser achieves average performance of 89.08% across 14 languages, compared to 88.73% for the Turbo parser, and 87.19% for MST. The power of the low-rank model becomes evident in the absence of any part-of-speech tags. For instance, on the English dataset, the low-rank model trained without POS tags achieves 90.49% on first-order parsing, while the baseline gets 86.70% if trained under the same conditions, and 90.58% if trained with 12 core POS tags. Finally, we demonstrate that the model can successfully leverage word vector representations, in contrast to the baselines.
A great deal of parsing research has been dedicated to feature engineering [18, 25, 26]. While in most state-of-the-art parsers, features are selected manually [27, 29, 16, 22, 44, 35], automatic feature selection methods are gaining popularity [23, 1, 31, 2]. Following standard machine learning practices, these algorithms iteratively select a subset of features by optimizing parsing performance on a development set. These feature selection methods are particularly promising in parsing scenarios where the optimal feature set is likely to be a small subset of the original set of candidate features. Our technique, in contrast, is suitable for cases where the relevant information is distributed across a larger set of related features.
A lot of recent work has been done on mapping words into vector spaces [8, 41, 11, 30]. Traditionally, these vector representations have been derived primarily from co-occurrences of words within sentences, ignoring syntactic roles of the co-occurring words. Nevertheless, any such word-level representation can be used to offset inherent sparsity problems associated with full lexicalization [5]. In this sense they perform a role similar to POS tags.
Word-level vector space embeddings have so far had limited impact on parsing performance. From a computational perspective, adding non-sparse vectors directly as features, including their combinations, can significantly increase the number of active features for scoring syntactic structures (e.g., dependency arc). Because of this issue, Cirik and Şensoy (2013) used word vectors only as unigram features (without combinations) as part of a shift reduce parser [32]. The improvement on the overall parsing performance was marginal. Another application of word vectors is compositional vector grammar [36]. While this method learns to map word combinations into vectors, it builds on existing word-level vector representations. In contrast, we represent words as vectors in a manner that is directly optimized for parsing. This framework enables us to learn new syntactically guided embeddings while also leveraging separately estimated word vectors as starting features, leading to improved parsing performance.
Many machine learning problems can be cast as matrix problems where the matrix represents a set of co-varying parameters. Such problems include, for example, multi-task learning and collaborative filtering. Rather than assuming that each parameter can be set independently of others, it is helpful to assume that the parameters vary in a low dimensional subspace that has to be estimated together with the parameters. In terms of the parameter matrix, this corresponds to a low-rank assumption. Low-rank constraints are commonly used for improving generalization [19, 37, 38, 12]
A strict low-rank assumption can be restrictive. Indeed, recent approaches to matrix problems decompose the parameter matrix as a sum of low-rank and sparse matrices [40, 47]. The sparse matrix is used to highlight a small number of parameters that should vary independently even if most of them lie on a low-dimensional subspace [42, 4]. We follow this decomposition while extending the parameter matrix into a tensor.
Tensors are multi-way generalizations of matrices and possess an analogous notion of rank. Tensors are increasingly used as tools in spectral estimation [15], including in parsing [6] and other NLP problems [10], where the goal is to avoid local optima in maximum likelihood estimation. In contrast, we expand features for parsing into a multi-way tensor, and operate with an explicit low-rank representation of the associated parameter tensor. The explicit representation sidesteps inherent complexity problems associated with the tensor rank [14]. Our parameters are divided into a sparse set corresponding to manually chosen MST or Turbo parser features and a larger set governed by a low-rank tensor.
We will commence here by casting first-order dependency parsing as a tensor estimation problem. We will start by introducing the notation used in the paper, followed by a more formal description of our dependency parsing task.
Let be a 3-dimensional tensor (a 3-way array). We denote each element of the tensor as where and is a shorthand for the set of integers . Similarly, we use and to represent the elements of matrix and vector , respectively.
We define the inner product of two tensors (or matrices) as , where concatenates the tensor (or matrix) elements into a column vector. The squared norm of a tensor/matrix is denoted by .
The Kronecker product of three vectors is denoted by and forms a rank-1 tensor such that
Note that the vectors , , and may be column or row vectors. Their orientation is defined based on usage. For example, is a rank-1 matrix when and are column vectors ( if they are row vectors).
We say that tensor is in Kruskal form if
(1) |
where , and is the row of matrix . We will directly learn a low-rank tensor (because is small) in this form as one of our model parameters.
Let be a sentence and the set of possible dependency trees over the words in . We assume that the score of each candidate dependency tree decomposes into a sum of “local” scores for arcs. Specifically:
where is the head-modifier dependency arc in the tree . Each is understood as a collection of arcs where and index words in .22Note that in the case of high-order parsing, the sum may also include local scores for other syntactic structures, such as grandhead-head-modifier score . See [22] for a complete list of these structures. For example, is the word corresponding to . We suppress the dependence on whenever it is clear from context. For example, can depend on in complicated ways as discussed below. The predicted parse is obtained as .
A key problem is how we parameterize the arc scores . Following the MST parser [27] we can define rich features characterizing each head-modifier arc, compiled into a sparse binary vector that depends on the sentence as well as the chosen arc (again, we suppress the dependence on ). Based on this feature representation, we define the score of each arc as where represent adjustable parameters to be learned, and is the number of parameters (and possible features in ).
Unigram features: | ||
---|---|---|
form | form-p | form-n |
lemma | lemma-p | lemma-n |
pos | pos-p | pos-n |
morph | bias | |
Bigram features: | ||
pos-p, pos | ||
pos, pos-n | ||
pos, lemma | ||
morph, lemma | ||
Trigram features: | ||
pos-p, pos, pos-n |
We can alternatively specify arc features in terms of rank-1 tensors by taking the Kronecker product of simpler feature vectors associated with the head (vector ), and modifier (vector ), as well as the arc itself (vector ). Here is much lower dimensional than the MST arc feature vector discussed earlier. For example, may be composed of only indicators for binned arc lengths33In our current version, only contains the binned arc length. Other possible features include, for example, the label of the arc , the POS tags between the head and the modifier, boolean flags which indicate the occurence of in-between punctutations or conjunctions, etc.. and , on the other hand, are built from features shown in Table 1. By taking the cross-product of all these component feature vectors, we obtain the full feature representation for arc as a rank-1 tensor
Note that elements of this rank-1 tensor include feature combinations that are not part of the feature crossings in . In this sense, the rank-1 tensor represents a substantial feature expansion. The arc score associated with the tensor representation is defined analogously as
where the adjustable parameters also form a tensor. Given the typical dimensions of the component feature vectors, , , , it is not even possible to store all the parameters in . Indeed, in the full English training set of CoNLL-2008, the tensor involves around entries while the MST feature vector has approximately features. To counter this feature explosion, we restrict the parameters to have low rank.
We can represent a rank-r tensor explicitly in terms of parameter matrices , , and as shown in Eq. 1. As a result, the arc score for the tensor reduces to evaluating , , and which are all dimensional vectors and can be computed efficiently based on any sparse vectors , , and . The resulting arc score is then
(2) |
By learning parameters , , and that function well in dependency parsing, we also learn context-dependent embeddings for words and arcs. Specifically, (for a given sentence, suppressed) is an dimensional vector representation of the word corresponding to as a head word. Similarly, provides an analogous representation for a modifier . Finally, is a vector embedding of the supplemental arc-dependent information. The resulting embedding is therefore tied to the syntactic roles of the words (and arcs), and learned in order to perform well in parsing.
We expect a dependency parsing model to benefit from several aspects of the low-rank tensor scoring. For example, we can easily incorporate additional useful features in the feature vectors , and , since the low-rank assumption (for small enough ) effectively counters the otherwise uncontrolled feature expansion. Moreover, by controlling the amount of information we can extract from each of the component feature vectors (via rank ), the statistical estimation problem does not scale dramatically with the dimensions of , and . In particular, the low-rank constraint can help generalize to unseen arcs. Consider a feature which is non-zero only for an arc with distance in sentence . If the arc has not been seen in the available training data, it does not contribute to the traditional arc score . In contrast, with the low-rank constraint, the arc score in Eq. 2 would typically be non-zero.
Our parsing model aims to combine the strengths of both traditional features from the MST/Turbo parser as well as the new low-rank tensor features. In this way, our model is able to capture a wide range of information including the auxiliary features without having uncontrolled feature explosion, while still having the full accessibility to the manually engineered features that are proven useful. Specifically, we define the arc score as the combination
(3) |
where , , , and are the model parameters to be learned. The rank and (balancing the two scores) represent hyper-parameters in our model.
The training set consists of pairs, where each pair consists of a sentence and the corresponding gold (target) parse . The goal is to learn values for the parameters , , and that optimize the combined scoring function , defined in Eq. 3, for parsing performance. We adopt a maximum soft-margin framework for this learning problem. Specifically, we find parameters , , , , and that minimize
(4) |
where is the number of mismatched arcs between the two trees, and is a non-negative slack variable. The constraints serve to separate the gold tree from other alternatives in with a margin that increases with distance.
The objective as stated is not jointly convex with respect to , and due to our explicit representation of the low-rank tensor. However, if we fix any two sets of parameters, for example, if we fix and , then the combined score will be a linear function of both and . As a result, the objective will be jointly convex with respect to and and could be optimized using standard tools. However, to accelerate learning, we adopt an online learning setup. Specifically, we use the passive-aggressive learning algorithm [9] tailored to our setting, updating pairs of parameter sets, , and in an alternating manner. This method is described below.
In an online learning setup, we update parameters successively based on each sentence. In order to apply the passive-aggressive algorithm, we fix two of , and (say, for example, and ) in an alternating manner, and apply a closed-form update to the remaining parameters (here and ). This is possible since the objective function with respect to has a similar form as in the original passive-aggressive algorithm. To illustrate this, consider a training sentence . The update involves finding first the best competing tree,
(5) |
which is the tree that violates the constraint in Eq. 4 most (i.e. maximizes the loss ). We then obtain parameter increments and by solving
In this way, the optimization problem attempts to keep the parameter change as small as possible, while forcing it to achieve mostly zero loss on this single instance. This problem has a closed form solution
where
loss | ||||
where is the Hadamard (element-wise) product. The magnitude of change of and is controlled by the parameter . By varying , we can determine an appropriate step size for the online updates. The updates also illustrate how balances the effect of the MST component of the score relative to the low-rank tensor score. When , the arc scores are entirely based on the low-rank tensor and . Note that , , , and are typically very sparse for each word or arc. Therefore and are also sparse and can be computed efficiently.
The alternating online algorithm relies on how we initialize , , and since each update is carried out in the context of the other two. A random initialization of these parameters is unlikely to work well, both due to the dimensions involved, and the nature of the alternating updates. We consider here instead a reasonable deterministic “guess” as the initialization method.
We begin by training our model without any low-rank parameters, and obtain parameters . The majority of features in this MST component can be expressed as elements of the feature tensor, i.e., as . We can therefore create a tensor representation of such that equals the corresponding parameter value in . We use a low-rank version of as the initialization. Specifically, we unfold the tensor B into a matrix of dimensions and , where and . For instance, a rank-1 tensor can be unfolded as . We compute the top-r SVD of the resulting unfolded matrix such that . is initialized as . Each right singular vector is also a matrix in . The leading left and right singular vectors of this matrix are assigned to and respectively. In our implementation, we run one epoch of our model without low-rank parameters and initialize the tensor .
The passive-aggressive algorithm regularizes the increments (e.g. and ) during each update but does not include any overall regularization. In other words, keeping updating the model may lead to large parameter values and over-fitting. To counter this effect, we use parameter averaging as used in the MST and Turbo parsers. The final parameters are those averaged across all the iterations (cf. [7]). For simplicity, in our algorithm we average , , and separately, which works well empirically.
First-order only | High-order | ||||||||
---|---|---|---|---|---|---|---|---|---|
Ours | NT-1st | MST | Turbo | Ours-3rd | NT-3rd | MST-2nd | Turbo-3rd | Best Published | |
Arabic | 79.60 | 78.71 | 78.3 | 77.23 | 79.95 | 79.53 | 78.75 | 79.64 | 81.12 (Ma11) |
Bulgarian | 92.30 | 91.14 | 90.98 | 91.76 | 93.50 | 92.79 | 91.56 | 93.1 | 94.02 (Zh13) |
Chinese | 91.43 | 90.85 | 90.40 | 88.49 | 92.68 | 92.39 | 91.77 | 89.98 | 91.89 (Ma10) |
Czech | 87.90 | 86.62 | 86.18 | 87.66 | 90.50 | 89.43 | 87.3 | 90.32 | 90.32 (Ma13) |
Danish | 90.64 | 89.80 | 89.84 | 89.42 | 91.39 | 90.82 | 90.5 | 91.48 | 92.00 (Zh13) |
Dutch | 84.81 | 83.77 | 82.89 | 83.61 | 86.41 | 86.08 | 84.11 | 86.19 | 86.19 (Ma13) |
English | 91.84 | 91.40 | 90.59 | 91.21 | 93.02 | 92.82 | 91.54 | 93.22 | 93.22 (Ma13) |
German | 90.24 | 89.70 | 89.54 | 90.52 | 91.97 | 92.26 | 90.14 | 92.41 | 92.41 (Ma13) |
Japanese | 93.74 | 93.36 | 93.38 | 92.78 | 93.71 | 93.23 | 92.92 | 93.52 | 93.72 (Ma11) |
Portuguese | 90.94 | 90.67 | 89.92 | 91.14 | 91.92 | 91.63 | 91.08 | 92.69 | 93.03 (Ko10) |
Slovene | 84.25 | 83.15 | 82.09 | 82.81 | 86.24 | 86.07 | 83.25 | 86.01 | 86.95 (Ma11) |
Spanish | 85.27 | 84.95 | 83.79 | 83.61 | 88.00 | 87.47 | 84.33 | 85.59 | 87.96 (Zh13) |
Swedish | 89.86 | 89.66 | 88.27 | 89.36 | 91.00 | 90.83 | 89.05 | 91.14 | 91.62 (Zh13) |
Turkish | 75.84 | 74.89 | 74.81 | 75.98 | 76.84 | 75.83 | 74.39 | 76.9 | 77.55 (Ko10) |
Average | 87.76 | 87.05 | 86.5 | 86.83 | 89.08 | 88.66 | 87.19 | 88.73 | 89.43 |
We test our dependency model on 14 languages, including the English dataset from CoNLL 2008 shared tasks and all 13 datasets from CoNLL 2006 shared tasks [3, 39]. These datasets include manually annotated dependency trees, POS tags and morphological information. Following standard practices, we encode this information as features.
We compare our model to MST and Turbo parsers on non-projective dependency parsing. For our parser, we train both a first-order parsing model (as described in Section 3 and 4) as well as a third-order model. The third order parser simply adds high-order features, those typically used in MST and Turbo parsers, into our scoring component. The decoding algorithm for the third-order parsing is based on [46]. For the Turbo parser, we directly compare with the recent published results in [22]. For the MST parser, we train and test using the most recent version of the code.44http://sourceforge.net/projects/mstparser/ In addition, we implemented two additional baselines, NT-1st (first order) and NT-3rd (third order), corresponding to our model without the tensor component.
For the arc feature vector , we use the same set of feature templates as MST v0.5.1. For head/modifier vector and , we show the complete set of feature templates used by our model in Table 1. Finally, we use a similar set of feature templates as Turbo v2.1 for 3rd order parsing.
To add auxiliary word vector representations, we use the publicly available word vectors [5], learned from raw data [13, 20]. Three languages in our dataset – English, German and Swedish – have corresponding word vectors in this collection.55https://github.com/wolet/sprml13-word-embeddings The dimensionality of this representation varies by language: English has 50 dimensional word vectors, while German and Swedish have 25 dimensional word vectors. Each entry of the word vector is added as a feature value into feature vectors and . For each word in the sentence, we add its own word vector as well as the vectors of its left and right words.
We should note that since our model parameter is represented and learned in the low-rank form, we only have to store and maintain the low-rank projections , and rather than explicitly calculate the feature tensor . Therefore updating parameters and decoding a sentence is still efficient, i.e., linear in the number of values of the feature vector. In contrast, assume we take the cross-product of the auxiliary word vector values, POS tags and lexical items of a word and its context, and add the crossed values into a normal model (in ). The number of features for each arc would be at least quadratic, growing into thousands, and would be a significant impediment to parsing efficiency.
Following standard practices, we train our full model and the baselines for 10 epochs. As the evaluation measure, we use unlabeled attachment scores (UAS) excluding punctuation. In all the reported experiments, the hyper-parameters are set as follows: (rank of the tensor), for first-order model and for third-order model.
Table 2 shows the performance of our model and the baselines on 14 CoNLL datasets. Our model outperforms Turbo parser, MST parser, as well as its own variants without the tensor component. The improvements of our low-rank model are consistent across languages: results for the first order parser are better on 11 out of 14 languages. By comparing NT-1st and NT-3rd (models without low-rank) with our full model (with low-rank), we obtain 0.7% absolute improvement on first-order parsing, and 0.3% improvement on third-order parsing. Our model also achieves the best UAS on 5 languages.
We next focus on the first-order model and gauge the impact of the tensor component. First, we test our model by varying the hyper-parameter which balances the tensor score and the traditional MST/Turbo score components. Figure 1 shows the average UAS on CoNLL test datasets after each training epoch. We can see that the improvement of adding the low-rank tensor is consistent across various choices of hyper parameter . When training with the tensor component alone (), the model converges more slowly. Learning of the tensor is harder because the scoring function is not linear (nor convex) with respect to parameters , and . However, the tensor scoring component achieves better generalization on the test data, resulting in better UAS than NT-1st after 8 training epochs.
To assess the ability of our model to incorporate a range of features, we add unsupervised word vectors to our model. As described in previous section, we do so by appending the values of different coordinates in the word vector into and . As Table 3 shows, adding this information increases the parsing performance for all the three languages. For instance, we obtain more than 0.5% absolute improvement on Swedish.
no word vector | with word vector | |
---|---|---|
English | 91.84 | 92.07 |
German | 90.24 | 90.48 |
Swedish | 89.86 | 90.38 |
Our model | NT-1st | |||
---|---|---|---|---|
-POS | +wv. | -POS | +POS | |
English | 88.89 | 90.49 | 86.70 | 90.58 |
German | 82.63 | 85.80 | 78.71 | 88.50 |
Swedish | 81.84 | 85.90 | 79.65 | 88.75 |
Since our model learns a compressed representation of feature vectors, we are interested to measure its performance when part-of-speech tags are not provided (See Table 4). The rationale is that given all other features, the model would induce representations that play a similar role to POS tags. Note that the performance of traditional parsers drops when tags are not provided. For example, the performance gap is 10% on German. Our experiments show that low-rank parser operates effectively in the absence of tags. In fact, it nearly reaches the performance of the original parser that used the tags on English.
We manually analyze low-dimensional projections to assess whether they capture syntactic abstraction. For this purpose, we train a model with only a tensor component (such that it has to learn an accurate tensor) on the English dataset and obtain low dimensional embeddings and for each word. The two r-dimension vectors are concatenated as an “averaged” vector. We use this vector to calculate the cosine similarity between words. Table 5 shows examples of five closest neighbors of queried words. While these lists include some noise, we can clearly see that the neighbors exhibit similar syntactic behavior. For example, “on” is close to other prepositions. More interestingly, we can consider the impact of syntactic context on the derived projections. The bottom part of Table 5 shows that the neighbors change substantially depending on the syntactic role of the word. For example, the closest words to the word “increase” are verbs in the context phrase “will increase again”, while the closest words become nouns given a different phrase “an increase of”.
greatly | profit | says | on | when | ||
actively | earnings | adds | with | where | ||
openly | franchisees | predicts | into | what | ||
significantly | shares | noted | at | why | ||
outright | revenue | wrote | during | which | ||
substantially | members | contends | over | who | ||
increase | will increase again | an increase of | ||||
rise | arguing | gain | ||||
advance | be | prices | ||||
contest | charging | payment | ||||
halt | gone | members | ||||
Exchequer | making | subsidiary | ||||
hit | attacks hit the | hardest hit is | ||||
shed | distributes | monopolies | ||||
rallied | stayed | pills | ||||
triggered | sang | sophistication | ||||
appeared | removed | ventures | ||||
understate | eased | factors |
#Tok. | Len. | Train. Time (hour) | ||
---|---|---|---|---|
NT-1st | Ours | |||
Arabic | 42K | 32 | 0.13 | 0.22 |
Chinese | 337K | 6 | 0.37 | 0.65 |
English | 958K | 24 | 1.88 | 2.83 |
Table 6 illustrates the impact of estimating low-rank tensor parameters on the running time of the algorithm. For comparison, we also show the NT-1st times across three typical languages. The Arabic dataset has the longest average sentence length, while the Chinese dataset has the shortest sentence length in CoNLL 2006. Based on these results, estimating a rank-50 tensor together with MST parameters only increases the running time by a factor of 1.7.
Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations. We introduce a low-rank factorization method that enables to map high dimensional feature vectors into low dimensional representations. Our method maintains the parameters as a low-rank tensor to obtain low dimensional representations of words in their syntactic roles, and to leverage modularity in the tensor for easy training with online algorithms. We implement the approach on first-order to third-order dependency parsing. Our parser outperforms the Turbo and MST parsers across 14 languages.
Future work involves extending the tensor component to capture higher-order structures. In particular, we would consider second-order structures such as grandparent-head-modifier by increasing the dimensionality of the tensor. This tensor will accordingly be a four or five-way array. The online update algorithm remains applicable since each dimension is optimized in an alternating fashion.
The authors acknowledge the support of the MURI program (W911NF-10-1-0533) and the DARPA BOLT program. This research is developed in collaboration with the Arabic Language Technoligies (ALT) group at Qatar Computing Research Institute (QCRI) within the lyas project. We thank Volkan Cirik for sharing the unsupervised word vector data. Thanks to Amir Globerson, Andreea Gane, the members of the MIT NLP group and the ACL reviewers for their suggestions and comments. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors, and do not necessarily reflect the views of the funding organizations.