Unsupervised Dependency Parsing with Transferring
Distribution via Parallel Guidance and Entropy Regularization

Xuezhe Ma
Department of Linguistics
University of Washington
Seattle, WA 98195, USA
xzma@uw.edu
&Fei Xia
Department of Linguistics
University of Washington
Seattle, WA 98195, USA
fxia@uw.edu
Abstract

We present a novel approach for inducing unsupervised dependency parsers for languages that have no labeled training data, but have translated text in a resource-rich language. We train probabilistic parsing models for resource-poor languages by transferring cross-lingual knowledge from resource-rich language with entropy regularization. Our method can be used as a purely monolingual dependency parser, requiring no human translations for the test data, thus making it applicable to a wide range of resource-poor languages. We perform experiments on three Data sets — Version 1.0 and version 2.0 of Google Universal Dependency Treebanks and Treebanks from CoNLL shared-tasks, across ten languages. We obtain state-of-the art performance of all the three data sets when compared with previously studied unsupervised and projected parsing systems.

1 Introduction

In recent years, dependency parsing has gained universal interest due to its usefulness in a wide range of applications such as synonym generation [43], relation extraction [37] and machine translation [21, 51]. Several supervised dependency parsing algorithms [39, 30, 32, 33, 8, 24, 27, 52] have been proposed and achieved high parsing accuracies on several treebanks, due in large part to the availability of dependency treebanks in a number of languages [31]. However, the manually annotated treebanks that these parsers rely on are highly expensive to create, in particular when we want to build treebanks for resource-poor languages. This led to a vast amount of research on unsupervised grammar induction  [9, 22, 47, 12, 48, 4, 29, 49], which appears to be a natural solution to this problem, as unsupervised methods require only unannotated text for training parsers. Unfortunately, the unsupervised grammar induction systems’ parsing accuracies often significantly fall behind those of supervised systems [34]. Furthermore, from a practical standpoint, it is rarely the case that we are completely devoid of resources for most languages.

In this paper, we consider a practically motivated scenario, in which we want to build statistical parsers for resource-poor target languages, using existing resources from a resource-rich source language (like English).11For the sake of simplicity, we refer to the resource-poor language as the “target language”, and resource-rich language as the “source language”. In addition, in this study we use English as the source resource-rich language, but our methodology can be applied to any resource-rich languages. We assume that there are absolutely no labeled training data for the target language, but we have access to parallel data with a resource-rich language and a sufficient amount of labeled training data to build an accurate parser for the resource-rich language. This scenario appears similar to the setting in bilingual text parsing. However, most bilingual text parsing approaches require bilingual treebanks — treebanks that have manually annotated tree structures on both sides of source and target languages [45, 7], or have tree structures on the source side and translated sentences in the target languages [18, 10]. Obviously, bilingual treebanks are much more difficult to acquire than the resources required in our scenario, since the labeled training data and the parallel text in our case are completely separated. What is more important is that most studies on bilingual text parsing assumed that the parser is applied only on bilingual text. But our goal is to develop a parser that can be used in completely monolingual setting for each target language of interest.

This scenario is applicable to a large set of languages and many research studies [19] have been made on it. Ganchev et al. [14] presented a parser projection approach via parallel text using the posterior regularization framework [16]. McDonald et al. [34] proposed two parser transfer approaches between two different languages — one is directly transferred parser from delexicalized parsers, and the other parser is transferred using constraint driven learning algorithm where constraints are drawn from parallel corpora. In that work, they demonstrate that even the directly transferred delexicalized parser produces significantly higher accuracies than unsupervised parsers. Cohen et al. [11] proposed an approach for unsupervised dependency parsing with non-parallel multilingual guidance from one or more helper languages, in which parallel data is not used.

In this work, we propose a learning framework for transferring dependency grammars from a resource-rich language to resource-poor languages via parallel text. We train probabilistic parsing models for resource-poor languages by maximizing a combination of likelihood on parallel data and confidence on unlabeled data. Our work is based on the learning framework used in Smith and Eisner [44], which is originally designed for parser bootstrapping. We extend this learning framework so that it can be used to transfer cross-lingual knowledge between different languages.

Throughout this paper, English is used as the source language and we evaluate our approach on ten target languages — Danish (da), Dutch (nl), French (fr), German (de), Greek (el), Italian (it), Korean (ko), Portuguese (pt), Spanish (es) and Swedish (sv). Our approach achieves significant improvement over previous state-of-the-art unsupervised and projected parsing systems across all the ten languages, and considerably bridges the gap to fully supervised dependency parsing performance.

2 Our Approach

Dependency trees represent syntactic relationships through labeled directed edges between heads and their dependents. For example, Figure 1 shows a dependency tree for the sentence, Economic news had little effect on financial markets, with the sentence’s root-symbol as its root. The focus of this work is on building dependency parsers for target languages, assuming that an accurate English dependency parser and some parallel text between the two languages are available. Central to our approach is a maximizing likelihood learning framework, in which we use an English parser and parallel text to estimate the “transferring distribution” of the target language parsing model (See Section 2.2 for more details). Another advantage of the learning framework is that it combines both the likelihood on parallel data and confidence on unlabeled data, so that both parallel text and unlabeled data can be utilized in our approach.

Figure 1: An example dependency tree.

2.1 Edge-Factored Parsing Model

In this paper, we will use the following notation: 𝒙 represents a generic input sentence, and 𝒚 represents a generic dependency tree. T(𝒙) is used to denote the set of possible dependency trees for sentence 𝒙. The probabilistic model for dependency parsing defines a family of conditional probability pλ(𝒚|𝒙) over all 𝒚 given sentence 𝒙, with a log-linear form:

pλ(𝒚|𝒙)=1Z(𝒙)exp{jλjFj(𝒚,𝒙)} (1)

where Fj are feature functions, λ=(λ1,λ2,) are parameters of the model, and Z(𝒙) is a normalization factor, which is commonly referred to as the partition function:

Z(𝒙)=𝒚T(𝒙)exp{jλjFj(𝒚,𝒙)} (2)

A common strategy to make this parsing model efficiently computable is to factor dependency trees into sets of edges:

Fj(𝒚,𝒙)=eyfj(e,𝒙). (3)

That is, dependency tree y is treated as a set of edges e and each feature function Fj(𝒚,𝒙) is equal to the sum of all the features fj(e,𝒙).

We denote the weight function of each edge e as follows:

w(e,𝒙)=exp{jλjfj(e,𝒙)} (4)

and the conditional probability pλ(𝒚|𝒙) has the following form:

pλ(𝒚|𝒙)=1Z(𝒙)eyw(e,𝒙) (5)

2.2 Model Training

One of the most common model training methods for supervised dependency parser is Maximum conditional likelihood estimation. For a supervised dependency parser with a set of training data {(𝒙i,𝒚i)}, the logarithm of the likelihood (a.k.a. the log-likelihood) is given by:

L(λ)=ilogpλ(𝒚i|𝒙i) (6)

Maximum likelihood training chooses parameters such that the log-likelihood L(λ) is maximized.

However, in our scenario we have no labeled training data for target languages but we have some parallel and unlabeled data plus an English dependency parser. For the purpose of transferring cross-lingual information from the English parser via parallel text, we explore the model training method proposed by Smith and Eisner [44], which presented a generalization of K function [1], and related it to another semi-supervised learning technique, entropy regularization [20, 28]. The objective K function to be minimized is actually the expected negative log-likelihood:

K = -i𝒚ip~(𝒚i|𝒙i)logpλ(𝒚i|𝒙i) (7)
= iD(p~i||pλ,i)+H(p~i)

where p~i()=defp~(|𝒙i) and pλ,i()=defpλ(|𝒙i). p~(𝒚|𝒙) is the “transferring distribution” that reflects our uncertainty about the true labels, and we are trying to learn a parametric model pλ(𝒚|𝒙) by minimizing the K function.

In our scenario, we have a set of aligned parallel data P={𝒙is,𝒙it,ai} where ai is the word alignment for the pair of source-target sentences (𝒙is,𝒙it), and a set of unlabeled sentences of the target language U={𝒙it}. We also have a trained English parsing model pλE(𝒚|𝒙). Then the K in equation (7) can be divided into two cases, according to whether 𝒙i belongs to parallel data set P or unlabeled data set U. For the unlabeled examples {𝒙iU}, some previous studies (e.g., [1]) simply use a uniform distribution over labels (e.g., parses), to reflect that the label is unknown. We follow the method in Smith and Eisner [44] and take the transferring distribution p~i to be the actual current belief pλ,i. The total contribution of the unsupervised examples to K then simplifies to KU=𝒙iUH(pλ,i), which may be regarded as the entropy item used to constrain the model’s uncertainty H to be low, as presented in the work on entropy regularization [20, 28].

But how can we define the transferring distribution for the parallel examples {𝒙itP}? We define the transferring distribution by defining the transferring weight utilizing the English parsing model pλE(𝒚|𝒙) via parallel data with word alignments:

w~(et,𝒙it)={wE(es,𝒙is),if etaligneswE(edelext,𝒙is),otherwise (8)

where wE(,) is the weight function of the English parsing model pλE(𝒚|𝒙), and edelext is the delexicalized form22The delexicalized form of an edge is an edge for which only delexicalized features are considered. of the edge et. From the definition of the transferring weight, we can see that, if an edge et of the target language sentence 𝒙it is aligned to an edge es of the English sentence 𝒙is, we transfer the weight of edge et to the corresponding weight of edge es in the English parsing model pλE(𝒚|𝒙). If the edge et is not aligned to any edges of the English sentence 𝒙is, we reduce the edge et to the delexicalized form and calculate the transferring weight in the English parsing model. There are two advantages for this definition of the transferring weight. First, by transferring the weight function to the corresponding weight in the well-developed English parsing model, we can project syntactic information across language boundaries. Second, McDonald et al. [34] demonstrates that parsers with only delexicalized features produce considerably high parsing performance. By reducing unaligned edges to their delexicalized forms, we can still use those delexicalized features, such as part-of-speech tags, for those unaligned edges, and can address problem that automatically generated word alignments include errors.

From the definition of transferring weight in equation (8), the transferring distribution can be defined in the following way:

p~(𝒚|𝒙)=1Z~(𝒙)eyw~(e,𝒙) (9)

where

Z~(𝒙)=yeyw~(e,𝒙) (10)

Due to the normalizing factor Z~(𝒙), the transferring distribution is a valid one.

We introduce a multiplier γ as a trade-off between the two contributions (parallel and unsupervised) of the objective function K, and the final objective function K has the following form:

K = -xiPyip~(𝒚i|𝒙i)logpλ(𝒚i|𝒙i) (11)
+γxiUH(pλ,i)
= KP+γKU

KP and KU are the contributions of the parallel and unsupervised data, respectively. One may regard γ as a Lagrange multiplier that is used to constrain the parser’s uncertainty H to be low, as presented in several studies on entropy regularization [5, 17, 20].

2.3 Algorithms and Complexity for Model Training

To train our parsing model, we need to find out the parameters λ that minimize the objective function K in equation (11). This optimization problem is typically solved using quasi-Newton numerical methods such as L-BFGS [36], which requires efficient calculation of the objective function and the gradient of the objective function.

The first item (KP) of the K function in equation (11) can be rewritten in the following form:

KP = -𝒙iP[yip~(𝒚i|𝒙i)e𝒚ilogw(e,𝒙i) (12)
-logZ(𝒙i)]

and according to equation (1) and (3) the gradient of KP can be written as:

KPλj = xiPp~(𝒚i|𝒙i)logpλ(𝒚i|𝒙i)λj (13)
= xiP[yip~(𝒚i|𝒙i)eyifj(e,𝒙i)
-yipλ(𝒚i|𝒙i)eyifj(e,𝒙i)]

According to equation (9), p~(𝒚|𝒙) can also be factored into the multiplication of the weight of each edge, so both KP and its gradient can be calculated by running the O(n3) inside-outside algorithm [2, 41] for projective parsing. For non-projective parsing, the analogy to the inside algorithm is the O(n3) matrix-tree algorithm based on Kirchhoff’s Matrix-Tree Theorem, which is dominated asymptotically by a matrix determinant [25, 46]. The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same O(n3) complexity as evaluating the function.

The second item (KU) of the K function in equation (11) is the Shannon entropy of the posterior distribution over parsing trees, and can be written into the following form:

KU = -𝒙iU[yipλ(𝒚i|𝒙i)e𝒚ilogw(e,𝒙i) (14)
-logZ(𝒙i)]

and the gradient of KU is in the following:

KUλj = xiUpλ(𝒚i|𝒙i)logpλ(𝒚i|𝒙i)λj (15)
= -yipλ(𝒚i|𝒙i)logpλ(𝒚i|𝒙i)Fj(𝒚i,𝒙i)
+ (yipλ(𝒚i|𝒙i)logpλ(𝒚i|𝒙i))
(yipλ(𝒚i|𝒙i)Fj(𝒚i,𝒙i))

Similar with the calculation of KP, KU can also be computed by running the inside-outside algorithm [2, 41] for projective parsing. For the gradient of KU, both the two multipliers of the second item in equation (15) can be computed using the same inside-outside algorithm. For the first item in equation (15), an O(n3) dynamic programming algorithm that is closely related to the forward-backward algorithm [28] for the entropy regularized CRF [20] can be used for projective parsing. For non-projective parsing, however, the runtime rises to O(n4). In this paper, we focus on projective parsing.

2.4 Summary of Our Approach

To summarize the description in the previous sections, our approach is performed in the following steps:

  1. 1.

    Train an English parsing model pλE(𝒚|𝒙), which is used to estimate the transferring distribution p~(𝒚|𝒙).

  2. 2.

    Prepare parallel text by running word alignment method to obtain word alignments,33The word alignment methods do not require additional resources besides parallel text. and prepare the unlabeled data.

  3. 3.

    Train a parsing model for the target language by minimizing the objective K function which is the combination of expected negative log-likelihood on parallel and unlabeled data.

#sents/#tokens
training dev test
Version 1.0
de 2,200/30,460 800/12,215 1,000/16,339
es 3,345/94,232 370/10,191 300/8,295
fr 3,312/74,979 366/8,071 300/6,950
ko 5,308/62,378 588/6,545 298/2,917
sv 4,447/66,631 493/9,312 1,219/20,376
Version 2.0
de 14,118/26,4906 800/12,215 1,000/16,339
es 14,138/37,5180 1,569/40,950 300/8,295
fr 14,511/35,1233 1,611/38,328 300/6,950
it 6,389/14,9145 400/9,541 400/9,187
ko 5437/60,621 603/6,438 299/2,631
pt 9,600/23,9012 1,200/29,873 1,198/29,438
sv 4,447/66,631 493/9,312 1,219/20,376
Table 1: Data statistics of two versions of Google Universal Treebanks for the target languages.

3 Data and Tools

In this section, we illustrate the data sets used in our experiments and the tools for data preparation.

# sents
500 1000 2000 5000 10000 20000
da 12,568 25,225 49,889 126,623 254,565 509,480
de 13,548 26,663 53,170 133,596 265,589 527,407
el 14,198 28,302 56,744 143,753 286,126 572,777
es 15,147 29,214 57,526 144,621 290,517 579,164
fr 15,046 29,982 60,569 153,874 306,332 609,541
it 15,151 29,786 57,696 145,717 288,337 573,557
ko 3,814 7,679 15,337 38,535 77,388 155,051
nl 13,234 26,777 54,570 137,277 274,692 551,463
pt 14,346 28,109 55,998 143,221 285,590 571,109
sv 12,242 24,897 50,047 123,069 246,619 490,086
Table 2: The number of tokens in parallel data used in our experiments. For all these corpora, the other language is English.

3.1 Choosing Target Languages

Our experiments rely on two kinds of data sets: (i) Monolingual Treebanks with consistent annotation schema — English treebank is used to train the English parsing model, and the Treebanks for target languages are used to evaluate the parsing performance of our approach. (ii) Large amounts of parallel text with English on one side. We select target languages based on the availability of these resources. The monolingual treebanks in our experiments are from the Google Universal Dependency Treebanks [31], for the reason that the treebanks of different languages in Google Universal Dependency Treebanks have consistent syntactic representations.

The parallel data come from the Europarl corpus version 7 [23] and Kaist Corpus44http://semanticweb.kaist.ac.kr/home/index.php/Corpus10. Taking the intersection of languages in the two kinds of resources yields the following seven languages: French, German, Italian, Korean, Portuguese, Spanish and Swedish.

The treebanks from CoNLL shared-tasks on dependency parsing [6, 38] appear to be another reasonable choice. However, previous studies [34, 31] have demonstrated that a homogeneous representation is critical for multilingual language technologies that require consistent cross-lingual analysis for downstream components, and the heterogenous representations used in CoNLL shared-tasks treebanks weaken any conclusion that can be drawn.

For comparison with previous studies, nevertheless, we also run experiments on CoNLL treebanks (see Section 4.4 for more details). We evaluate our approach on three target languages from CoNLL shared task treebanks, which do not appear in Google Universal Treebanks. The three languages are Danish, Dutch and Greek. So totally we have ten target languages. The parallel data for these three languages are also from the Europarl corpus version 7.

3.2 Word Alignments

In our approach, word alignments for the parallel text are required. We perform word alignments with the open source GIZA++ toolkit55https://code.google.com/p/giza-pp/. The parallel corpus was preprocessed in standard ways, selecting sentences with the length in the range from 3 to 100. Then we run GIZA++ with the default setting to generate word alignments in both directions. We then make the intersection of the word alignments of two directions to generate one-to-one alignments.

DTP DTP† PTP† -U +U OR
de 58.50 58.46 69.21 73.72 74.01 78.64
es 68.07 68.72 72.57 75.32 75.60 82.56
fr 70.14 71.13 74.60 76.65 76.93 83.69
ko 42.37 43.57 53.72 59.72 59.94 89.85
sv 70.56 70.59 75.87 78.91 79.27 85.59
Ave 61.93 62.49 69.19 72.86 73.15 84.67
Table 3: UAS for two versions of our approach, together with baseline and oracle systems on Google Universal Treebanks version 1.0. “Ave” is the macro-average across the five languages.

3.3 Part-of-Speech Tagging

Several features in our parsing model involve part-of-speech (POS) tags of the input sentences. The set of POS tags needs to be consistent across languages and treebanks. For this reason we use the universal POS tag set of Petrov et al. [42]. This set consists of the following 12 coarse-grained tags: NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners), ADP (prepositions or postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), PUNC (punctuation marks) and X (a catch-all for other categories such as abbreviations or foreign words).

POS tags are not available for parallel data in the Europarl and Kaist corpus, so we need to provide the POS tags for these data. In our experiments, we train a Stanford POS Tagger [50] for each language. The labeled training data for each POS tagger are extracted from the training portion of each Treebanks. The average tagging accuracy is around 95%.

Undoubtedly, we are primarily interested in applying our approach to build statistical parsers for resource-poor target languages without any knowledge. For the purpose of evaluation of our approach and comparison with previous work, we need to exploit the gold POS tags to train the POS taggers. As part-of-speech tags are also a form of syntactic analysis, this assumption weakens the applicability of our approach. Fortunately, some recently proposed POS taggers, such as the POS tagger of Das and Petrov [13], rely only on labeled training data for English and the same kind of parallel text in our approach. In practice we can use this kind of POS taggers to predict POS tags, whose tagging accuracy is around 85%.

DTP† PTP† -U +U OR
de 58.56 69.77 73.92 74.30 81.65
es 68.72 73.22 75.21 75.53 83.92
fr 71.13 74.75 76.14 76.53 83.51
it 70.74 76.08 77.55 77.74 85.47
ko 38.55 43.34 59.71 59.89 90.42
pt 69.82 74.59 76.30 76.65 85.67
sv 70.59 75.87 78.91 79.27 85.59
Ave 64.02 69.66 73.96 74.27 85.18
Table 4: UAS for two versions of our approach, together with baseline and oracle systems on Google Universal Treebanks version 2.0. “Ave” is the macro-average across the seven languages.

4 Experiments

In this section, we will describe the details of our experiments and compare our results with previous methods.

Google Universal Treebanks V1.0
de es fr ko sv
# sents PTP† -U +U PTP† -U +U PTP† -U +U PTP† -U +U PTP† -U +U
500 63.23 70.79 70.93 70.09 72.32 72.64 72.24 74.64 74.90 47.71 56.87 57.22 71.70 75.88 76.13
1000 65.61 71.71 71.86 70.90 73.44 73.67 72.95 75.07 75.35 47.83 57.65 58.15 72.38 76.55 77.03
2000 66.52 72.33 72.48 72.01 73.57 73.81 73.69 75.88 76.22 48.37 58.19 58.44 73.65 77.86 78.12
5000 67.79 73.06 73.31 72.34 74.30 74.79 74.31 76.02 76.29 53.02 58.57 59.04 74.88 78.48 78.70
10000 68.44 73.59 73.92 72.48 74.86 75.26 74.43 76.14 76.34 53.61 59.17 59.55 75.34 78.78 79.08
20000 69.21 73.72 74.01 72.57 75.32 75.60 74.60 76.55 76.93 53.72 59.72 59.94 75.87 78.91 79.27
Google Universal Treebanks V2.0
de es fr ko it
# sents PTP† -U +U PTP† -U +U PTP† -U +U PTP† -U +U PTP† -U +U
500 60.10 71.07 71.39 69.52 72.97 73.28 71.10 74.57 74.70 40.09 56.60 57.10 72.80 75.67 75.94
1000 61.76 72.15 72.39 70.78 73.48 73.79 72.14 75.13 75.43 40.44 57.55 57.93 73.55 76.43 76.67
2000 65.35 72.73 73.04 71.75 74.10 74.35 73.21 75.78 76.06 40.87 58.11 58.43 74.44 76.99 77.39
5000 67.86 73.32 73.62 72.43 74.55 74.83 74.14 75.83 76.02 40.90 58.48 58.96 75.07 77.10 77.34
10000 68.70 73.71 74.02 72.85 74.80 74.95 74.53 75.97 76.17 41.29 59.13 59.44 75.65 77.50 77.71
20000 69.77 73.92 74.30 73.22 75.21 75.53 74.75 76.14 76.53 43.34 59.71 59.89 76.08 77.55 77.74
pt
# sents PTP† -U +U
500 71.34 74.41 74.68
1000 71.91 74.48 75.08
2000 72.93 75.10 75.32
5000 73.78 75.88 75.98
10000 74.40 75.99 76.15
20000 74.59 76.30 76.65
Table 5: Parsing results of our approach with different amount of parallel data on Google Universal Treebanks version 1.0 and 2.0. We omit the results of Swedish for treebanks version 2.0 since the data for Swedish from version 2.0 are exactly the same with those from version 1.0.

4.1 Data Sets

As presented in Section 3.1, we evaluate our parsing approach on both version 1.0 and version 2.0 of Google Univereal Treebanks for seven languages66Japanese and Indonesia are excluded as no practicable parallel data are available.. We use the standard splits of the treebank for each language as specified in the release of the data77https://code.google.com/p/uni-dep-tb/. Table 1 presents the statistics of the two versions of Google Universal Treebanks. We strip all the dependency annotations off the training portion of each treebank, and use that as the unlabeled data for that target language. We train our parsing model with different numbers of parallel sentences to analyze the influence of the amount of parallel data on the parsing performance of our approach. The parallel data sets contain 500, 1000, 2000, 5000, 10000 and 20000 parallel sentences, respectively. We randomly extract parallel sentences from each corpora, and smaller data sets are subsets of larger ones. Table 2 shows the number of tokens in the parallel data used in the experiments.

4.2 System performance and comparison
on Google Universal Treebanks

For the comparison of parsing performance, we run experiments on the following systems:

DTP:

The direct transfer parser (DTP) proposed by McDonald et al. [34], who train a delexicalized parser on English labeled training data with no lexical features, then apply this parser to parse target languages directly. It is based on the transition-based dependency parsing paradigm [40]. We directly cite the results reported in McDonald et al. [31]. In addition to their original results, we also report results by re-implementing the direct transfer parser based on the first-order projective dependency parsing model [30] (DTP†).

PTP

The projected transfer parser (PTP) described in McDonald et al. [34]. The results of the projected transfer parser re-implemented by us is marked as “PTP†”.

-U:

Our approach training on only parallel data without unlabeled data for the target language. The parallel data set for each language contains 20,000 sentences.

+U:

Our approach training on both parallel and unlabeled data. The parallel data sets are the ones contains 20,000 sentences.

OR:

the supervised first-order projective dependency parsing model [30], trained on the original treebanks with maximum likelihood estimation (equation 6). One may regard this system as an oracle of transfer parsing.

Parsing accuracy is measured with unlabeled attachment score (UAS): the percentage of words with the correct head.

Table 3 and Table 4 shows the parsing results of our approach, together with the results of the baseline systems and the oracle, on version 1.0 and version 2.0 of Google Universal Treebanks, respectively. Our approaches significantly outperform all the baseline systems across all the seven target languages. For the results on Google Universal Treebanks version 1.0, the improvement on average over the projected transfer paper (PTP†) is 3.96% and up to 6.22% for Korean and 4.80% for German. For the other three languages, the improvements are remarkable, too — 2.33% for French, 3.03% for Spanish and 3.40% for Swedish. By adding entropy regularization from unlabeled data, our full model achieves average improvement of 0.29% over the “-U” setting. Moreover, our approach considerably bridges the gap to fully supervised dependency parsers, whose average UAS is 84.67%. For the results on treebanks version 2.0, we can get similar observation and draw the same conclusion.

4.3 Effect of the Amount of Parallel Text

Table 5 illustrates the UAS of our approach trained on different amounts of parallel data, together with the results of the projected transfer parser re-implemented by us (PTP†). We run two versions of our approach for each of the parallel data sets, one with unlabeled data (+U) and the other without them (-U). From table 5 we can get three observations. First, even the parsers trained with only 500 parallel sentences achieve considerably high parsing accuracies (average 70.10% for version 1.0 and 71.59% for version 2.0). This demonstrates that our approach does not rely on a large amount of parallel data. Second, when gradually increasing the amount of parallel data, the parsing performance continues improving. Third, entropy regularization with unlabeled data makes modest improvement on parsing performance over the parsers without unlabeled data. This proves the effectiveness of the entropy regularization from unlabeled data.

4.4 Experiments on CoNLL Treebanks

To make a thorough empirical comparison with previous studies, we also evaluate our system without unlabeled data (-U) on treebanks from CoNLL shared task on dependency parsing [6, 38]. To facilitate comparison, we use the same eight Indo-European languages as target languages: Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish, and same experimental setup as McDonald et al. [34]. We report both the results of the direct transfer and projected transfer parsers directly cited from McDonald et al. [34] (DTP and PTP) and re-implemented by us (DTP†and PTP†).

Table 6 gives the results comparing the model without unlabeled data (-U) presented in this work to those five baseline systems and the oracle (OR). The results of unsupervised DMV model [22] are from Table 1 of McDonald et al. [34]. Our approach outperforms all these baseline systems and achieves state-of-the-art performance on all the eight languages.

In order to compare with more previous methods, we also report parsing performance on sentences of length 10 or less after punctuation has been removed. Table 7 shows the results of our system and the results of baseline systems. “USR†” is the weakly supervised system of Naseem et al. [35]. “PGI” is the phylogenetic grammar induction model of Berg-Kirkpatrick and Klein [3]. Both the results of the two systems are cited from Table 4 of McDonald et al. [34]. We also include the results of the unsupervised dependency parsing model with non-parallel multilingual guidance (NMG) proposed by Cohen et al. [11]88For each language, we use the best result of the four systems in Table 3 of Cohen et al. [11], and “PR” which is the posterior regularization approach presented in Gillenwater et al. [15]. All the results are shown in Table 7.

From Table 7, we can see that among the eight target languages, our approach achieves best parsing performance on six languages — Danish, German, Greek, Italian, Portuguese and Swedish. It should be noted that the “NMG” system utilizes more than one helper languages. So it is not directly comparable to our work.

DMV DTP DTP† PTP PTP† -U OR
da 33.4 45.9 46.8 48.2 50.0 50.1 87.1
de 18.0 47.2 46.0 50.9 52.4 57.3 87.0
el 39.9 63.9 62.9 66.8 65.3 67.4 82.3
es 28.5 53.3 54.4 55.8 59.9 60.3 83.6
it 43.1 57.7 59.9 60.8 63.4 64.0 83.9
nl 38.5 60.8 60.7 67.8 66.5 68.2 78.2
pt 20.1 69.2 71.1 71.3 74.8 75.1 87.2
sv 44.0 58.3 60.3 61.3 62.8 66.7 88.0
Ave 33.2 57.0 57.8 60.4 61.9 63.6 84.7
Table 6: Parsing results on treebanks from CoNLL shared tasks for eight target languages. The results of unsupervised DMV model are from Table 1 of McDonald et al. [34].
DTP DTP† PTP PTP† USR† PGI PR NMG -U
da 53.2 55.3 57.4 59.8 55.1 41.6 44.0 59.9 60.1
de 65.9 57.9 67.0 63.5 60.0 67.5
el 73.9 70.8 73.9 72.3 60.3 73.0 74.3
es 58.0 62.3 62.3 66.1 68.3 58.4 62.4 76.7 64.6
it 65.5 66.9 69.9 71.5 47.9 73.6
nl 67.6 66.0 72.2 72.1 44.0 45.1 37.9 50.7 70.5
pt 77.9 79.2 80.6 82.9 70.9 63.0 47.8 79.8 83.3
sv 70.4 70.2 71.3 70.4 52.6 58.3 42.2 74.0 75.1
Ave 66.6 66.1 69.4 69.8 57.4 71.1
Table 7: UAS on sentences of length 10 or less without punctuation from CoNLL shared task treebanks. “USR†” is the weakly supervised system of Naseem et al. [35]. “PGI” is the phylogenetic grammar induction model of Berg-Kirkpatrick and Klein [3]. Both the “USR†” and “PGI” systems are implemented and reported by McDonald et al. [34]. “NMG” is the unsupervised dependency parsing model with non-parallel multilingual guidance [11]. “PR” is the posterior regularization approach presented in Gillenwater et al. [15]. Some systems’ results for certain target languages are not available as marked by —.

4.5 Extensions

In this section, we briefly outline a few extensions to our approach that we want to explore in future work.

4.5.1 Non-Projective Parsing

As mentioned in section 2.3, the runtime to compute KU and its gradient is O(n4). One reasonable speedup, as presented in Smith and Eisner [44], is to replace Shannon entropy with Rényi entropy. The Rényi entropy is parameterized by α:

Rα(p)=11-αlog(yp(y)α) (16)

With Rényi entropy, the computation of KU and its gradient is O(n3), even for non-projective case.

4.5.2 Higher-Order Models for Projective Parsing

Our learning framework can be extended to higher-order dependency parsing models. For example, if we want to make our model capable of utilizing more contextual information, we can extend our transferring weight to higher-order parts:

w~(pt,𝒙it)={wE(ps,𝒙is),if ptalignpswE(pdelext,𝒙is),otherwise (17)

where p is a small part of tree 𝒚 that has limited interactions. For projective parsing, several algorithms [33, 8, 24, 27] have been proposed to solve the model training problems (calculation of objective function and gradient) for different factorizations.

4.5.3 IGT Data

One possible direction to improve our approach is to replace parallel text with Interlinear Glossed Text (IGT) [26], which is a semi-structured data type encoding more syntactic information than parallel data. By using IGT Data, not only can we obtain more accurate word alignments, but also extract useful cross-lingual information for the resource-poor language.

5 Conclusion

In this paper, we propose an unsupervised projective dependency parsing approach for resource-poor languages, using existing resources from a resource-rich source language. By presenting a model training framework, our approach can utilize parallel text to estimate transferring distribution with the help of a well-developed resource-rich language dependency parser, and use unlabeled data as entropy regularization. The experimental results on three data sets across ten target languages show that our approach achieves significant improvement over previous studies.

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. BCS-0748919. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References

  • [1] S. Abney(2004) Understanding the Yarowsky algorithm. Computational Linguistics 30, pp. 2004. Cited by: 2.2, 2.2.
  • [2] J. K. Baker(1979) Trainable grammars for speech recognition. pp. 547–550. Cited by: 2.3, 2.3.
  • [3] T. Berg-Kirkpatrick and D. Klein(2010-07) Phylogenetic grammar induction. Uppsala, Sweden, pp. 1288–1297. External Links: Link Cited by: 4.4, 7.
  • [4] P. Blunsom and T. Cohn(2010-10) Unsupervised induction of tree substitution grammars for dependency parsing. Cambridge, MA, pp. 1204–1213. External Links: Link Cited by: 1.
  • [5] M. Brand(1998) Structure learning in conditional probability models via an entropic prior and parameter extinction. Neural Computation 11 (5), pp. 1155–1182. Cited by: 2.2.
  • [6] S. Buchholz and E. Marsi(2006) CoNLL-X shared task on multilingual dependency parsing. New York, NY, pp. 149–164. Cited by: 3.1, 4.4.
  • [7] D. Burkett and D. Klein(2008-10) Two languages are better than one (for syntactic parsing). Honolulu, Hawaii, pp. 877–886. External Links: Link Cited by: 1.
  • [8] X. Carreras(2007) Experiments with a higher-order projective dependency parser. pp. 957–961. Cited by: 1, 4.5.2.
  • [9] G. Carroll and E. Charniak(1992) Two experiments on learning probabilistic dependency grammars from corpora. Cited by: 1.
  • [10] W. Chen, J. Kazama and K. Torisawa(2010-07) Bitext dependency parsing with bilingual subtree constraints. Uppsala, Sweden, pp. 21–29. External Links: Link Cited by: 1.
  • [11] S. B. Cohen, D. Das and N. A. Smith(2011-07) Unsupervised structure prediction with non-parallel multilingual guidance. Edinburgh, Scotland, UK., pp. 50–61. External Links: Link Cited by: 1, 4.4, 7.
  • [12] S. Cohen and N. A. Smith(2009-06) Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. Boulder, Colorado, pp. 74–82. External Links: Link Cited by: 1.
  • [13] D. Das and S. Petrov(2011-06) Unsupervised part-of-speech tagging with bilingual graph-based projections. Portland, Oregon, USA, pp. 600–609. External Links: Link Cited by: 3.3.
  • [14] K. Ganchev, J. Gillenwater and B. Taskar(2009-08) Dependency grammar induction via bitext projection constraints. Suntec, Singapore, pp. 369–377. External Links: Link Cited by: 1.
  • [15] J. Gillenwater, K. Ganchev, J. Graça, F. Pereira and B. Taskar(2010-07) Sparsity in dependency grammar induction. Uppsala, Sweden, pp. 194–199. External Links: Link Cited by: 4.4, 7.
  • [16] J. V. Graca, L. Inesc-id, K. Ganchev and B. Taskar(2007) Expectation maximization and posterior constraints. pp. 569–576. Cited by: 1.
  • [17] Y. Grandvalet and Y. Bengio(2004) Semi-supervised learning by entropy minimization. Cited by: 2.2.
  • [18] L. Huang, W. Jiang and Q. Liu(2009-08) Bilingually-constrained (monolingual) shift-reduce parsing. Singapore, pp. 1222–1231. External Links: Link Cited by: 1.
  • [19] R. Hwa, P. Resnik, A. Weinberg, C. Cabezas and O. Kolak(2005) Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11, pp. 11–311. Cited by: 1.
  • [20] F. Jiao, S. Wang, C. Lee, R. Greiner and D. Schuurmans(2006-07) Semi-supervised conditional random fields for improved sequence segmentation and labeling. Sydney, Australia, pp. 209–216. External Links: Link, Document Cited by: 2.2, 2.2, 2.2, 2.3.
  • [21] J. Katz-Brown, S. Petrov, R. McDonald, F. Och, D. Talbot, H. Ichikawa, M. Seno and H. Kazawa(2011-07) Training a parser for machine translation reordering. Edinburgh, Scotland, UK., pp. 183–192. External Links: Link Cited by: 1.
  • [22] D. Klein and C. Manning(2004-07) Corpus-based induction of syntactic structure: models of dependency and constituency. Barcelona, Spain, pp. 478–485. External Links: Link, Document Cited by: 1, 4.4.
  • [23] P. Koehn(2005) Europarl: A Parallel Corpus for Statistical Machine Translation. Phuket, Thailand, pp. 79–86. External Links: Link Cited by: 3.1.
  • [24] T. Koo and M. Collins(2010-07) Efficient third-order dependency parsers. Uppsala, Sweden, pp. 1–11. Cited by: 1, 4.5.2.
  • [25] T. Koo, A. Globerson, X. Carreras and M. Collins(2007-06) Structured predicition models via the matrix-tree theorem. Prague, Czech, pp. 141–150. Cited by: 2.3.
  • [26] W. D. Lewis and F. Xia(2010) Developing odin: a multilingual repository of annotated language data for hundreds of the world’s languages.. LLC 25 (3), pp. 303–319. External Links: Link Cited by: 4.5.3.
  • [27] X. Ma and H. Zhao(2012-12) Fourth-order dependency parsing. Mumbai, India, pp. 785–796. External Links: Link Cited by: 1, 4.5.2.
  • [28] G. S. Mann and A. McCallum(2007) Efficient computation of entropy gradient for semi-supervised conditional random fields. Stroudsburg, PA, USA, pp. 109–112. External Links: Link Cited by: 2.2, 2.2, 2.3.
  • [29] D. Mareček and M. Straka(2013-08) Stop-probability estimates computed on a large corpus improve unsupervised dependency parsing. Sofia, Bulgaria, pp. 281–290. External Links: Link Cited by: 1.
  • [30] R. McDonald, K. Crammer and F. Pereira(2005-June 25-30) Online large-margin training of dependency parsers. Ann Arbor, Michigan, USA, pp. 91–98. Cited by: DTP:, OR:, 1.
  • [31] R. McDonald, J. Nivre, Y. Quirmbach-Brundage, Y. Goldberg, D. Das, K. Ganchev, K. Hall, S. Petrov, H. Zhang, O. Täckström, C. Bedini, N. Bertomeu Castelló and J. Lee(2013-08) Universal dependency annotation for multilingual parsing. Sofia, Bulgaria, pp. 92–97. External Links: Link Cited by: DTP:, 1, 3.1, 3.1.
  • [32] R. McDonald, F. Pereira, K. Ribarov and J. Hajic(2005-10) Non-projective dependency parsing using spanning tree algorithms. Vancouver, Canada, pp. 523–530. Cited by: 1.
  • [33] R. McDonald and F. Pereira(2006-04) Online learning of approximate dependency parsing algorithms. Trento, Italy, pp. 81–88. Cited by: 1, 4.5.2.
  • [34] R. McDonald, S. Petrov and K. Hall(2011-07) Multi-source transfer of delexicalized dependency parsers. Edinburgh, Scotland, UK., pp. 62–72. External Links: Link Cited by: DTP:, PTP, 1, 1, 2.2, 3.1, 4.4, 4.4, 4.4, 6, 7.
  • [35] T. Naseem, H. Chen, R. Barzilay and M. Johnson(2010-10) Using universal linguistic knowledge to guide grammar induction. Cambridge, MA, pp. 1234–1244. External Links: Link Cited by: 4.4, 7.
  • [36] S. G. Nash and J. Nocedal(1991) A numerical study of the limited memory bfgs method and truncated-newton method for large scale optimization. SIAM Journal on Optimization 1(2), pp. 358–372. Cited by: 2.3.
  • [37] T. T. Nguyen, A. Moschitti and G. Riccardi(2009-08) Convolution kernels on constituent, dependency and sequential structures for relation extraction. Singapore, pp. 1378–1387. External Links: Link Cited by: 1.
  • [38] J. Nivre, J. Hall, S. Kübler, R. Mcdonald, J. Nilsson, S. Riedel and D. Yuret(2007) The conll 2007 shared task on dependency parsing. Prague, Czech, pp. 915–932. Cited by: 3.1, 4.4.
  • [39] J. Nivre and M. Scholz(2004-August 23-27) Deterministic dependency parsing of English text. Geneva, Switzerland, pp. 64–70. Cited by: 1.
  • [40] J. Nivre(2008-12) Algorithms for deterministic incremental dependency parsing. Comput. Linguist. 34 (4), pp. 513–553. External Links: ISSN 0891-2017, Link, Document Cited by: DTP:.
  • [41] M. A. Paskin(2001) Cubic-time parsing and learning algorithms for grammatical bigram models. Cited by: 2.3, 2.3.
  • [42] S. Petrov, D. Das and R. T. McDonald(2011) A universal part-of-speech tagset. CoRR abs/1104.2086. Cited by: 3.3.
  • [43] Y. Shinyama, S. Sekine and K. Sudo(2002) Automatic paraphrase acquisition from news articles. pp. 313–318. Cited by: 1.
  • [44] D. A. Smith and J. Eisner(2007-06) Bootstrapping feature-rich dependency parsers with entropic priors. Prague, Czech Republic, pp. 667–677. External Links: Link Cited by: 1, 2.2, 2.2, 4.5.1.
  • [45] D. A. Smith and N. A. Smith(2004) Bilingual parsing with factored estimation: using English to parse Korean. pp. 49–56. Cited by: 1.
  • [46] D. A. Smith and N. A. Smith(2007-06) Probabilistic models of nonporjective dependency trees. Prague, Czech, pp. 132–140. Cited by: 2.3.
  • [47] N. A. Smith and J. Eisner(2005-06) Contrastive estimation: training log-linear models on unlabeled data. Ann Arbor, Michigan, pp. 354–362. External Links: Link, Document Cited by: 1.
  • [48] V. I. Spitkovsky, H. Alshawi and D. Jurafsky(2010-06) From baby steps to leapfrog: how “less is more” in unsupervised dependency parsing. Los Angeles, California, pp. 751–759. External Links: Link Cited by: 1.
  • [49] V. I. Spitkovsky, H. Alshawi and D. Jurafsky(2013-10) Breaking out of local optima with count transforms and model recombination: a study in grammar induction. Seattle, Washington, USA, pp. 1983–1995. External Links: Link Cited by: 1.
  • [50] K. Toutanova, D. Klein, C. D. Manning and Y. Singer(2003) Feature-rich part-of-speech tagging with a cyclic dependency network. pp. 252–259. Cited by: 3.3.
  • [51] J. Xie, H. Mi and Q. Liu(2011-07) A novel dependency-to-string model for statistical machine translation. Edinburgh, Scotland, UK., pp. 216–226. External Links: Link Cited by: 1.
  • [52] H. Zhang, L. Huang, K. Zhao and R. McDonald(2013-10) Online learning for inexact hypergraph search. Seattle, Washington, USA, pp. 908–913. External Links: Link Cited by: 1.