We present a novel approach for inducing unsupervised dependency parsers for languages that have no labeled training data, but have translated text in a resource-rich language. We train probabilistic parsing models for resource-poor languages by transferring cross-lingual knowledge from resource-rich language with entropy regularization. Our method can be used as a purely monolingual dependency parser, requiring no human translations for the test data, thus making it applicable to a wide range of resource-poor languages. We perform experiments on three Data sets — Version 1.0 and version 2.0 of Google Universal Dependency Treebanks and Treebanks from CoNLL shared-tasks, across ten languages. We obtain state-of-the art performance of all the three data sets when compared with previously studied unsupervised and projected parsing systems.
In recent years, dependency parsing has gained universal interest due to its usefulness in a wide range of applications such as synonym generation [43], relation extraction [37] and machine translation [21, 51]. Several supervised dependency parsing algorithms [39, 30, 32, 33, 8, 24, 27, 52] have been proposed and achieved high parsing accuracies on several treebanks, due in large part to the availability of dependency treebanks in a number of languages [31]. However, the manually annotated treebanks that these parsers rely on are highly expensive to create, in particular when we want to build treebanks for resource-poor languages. This led to a vast amount of research on unsupervised grammar induction [9, 22, 47, 12, 48, 4, 29, 49], which appears to be a natural solution to this problem, as unsupervised methods require only unannotated text for training parsers. Unfortunately, the unsupervised grammar induction systems’ parsing accuracies often significantly fall behind those of supervised systems [34]. Furthermore, from a practical standpoint, it is rarely the case that we are completely devoid of resources for most languages.
In this paper, we consider a practically motivated scenario, in which we want to build statistical parsers for resource-poor target languages, using existing resources from a resource-rich source language (like English).11For the sake of simplicity, we refer to the resource-poor language as the “target language”, and resource-rich language as the “source language”. In addition, in this study we use English as the source resource-rich language, but our methodology can be applied to any resource-rich languages. We assume that there are absolutely no labeled training data for the target language, but we have access to parallel data with a resource-rich language and a sufficient amount of labeled training data to build an accurate parser for the resource-rich language. This scenario appears similar to the setting in bilingual text parsing. However, most bilingual text parsing approaches require bilingual treebanks — treebanks that have manually annotated tree structures on both sides of source and target languages [45, 7], or have tree structures on the source side and translated sentences in the target languages [18, 10]. Obviously, bilingual treebanks are much more difficult to acquire than the resources required in our scenario, since the labeled training data and the parallel text in our case are completely separated. What is more important is that most studies on bilingual text parsing assumed that the parser is applied only on bilingual text. But our goal is to develop a parser that can be used in completely monolingual setting for each target language of interest.
This scenario is applicable to a large set of languages and many research studies [19] have been made on it. Ganchev et al. [14] presented a parser projection approach via parallel text using the posterior regularization framework [16]. McDonald et al. [34] proposed two parser transfer approaches between two different languages — one is directly transferred parser from delexicalized parsers, and the other parser is transferred using constraint driven learning algorithm where constraints are drawn from parallel corpora. In that work, they demonstrate that even the directly transferred delexicalized parser produces significantly higher accuracies than unsupervised parsers. Cohen et al. [11] proposed an approach for unsupervised dependency parsing with non-parallel multilingual guidance from one or more helper languages, in which parallel data is not used.
In this work, we propose a learning framework for transferring dependency grammars from a resource-rich language to resource-poor languages via parallel text. We train probabilistic parsing models for resource-poor languages by maximizing a combination of likelihood on parallel data and confidence on unlabeled data. Our work is based on the learning framework used in Smith and Eisner [44], which is originally designed for parser bootstrapping. We extend this learning framework so that it can be used to transfer cross-lingual knowledge between different languages.
Throughout this paper, English is used as the source language and we evaluate our approach on ten target languages — Danish (da), Dutch (nl), French (fr), German (de), Greek (el), Italian (it), Korean (ko), Portuguese (pt), Spanish (es) and Swedish (sv). Our approach achieves significant improvement over previous state-of-the-art unsupervised and projected parsing systems across all the ten languages, and considerably bridges the gap to fully supervised dependency parsing performance.
Dependency trees represent syntactic relationships through labeled directed edges between heads and their dependents. For example, Figure 1 shows a dependency tree for the sentence, Economic news had little effect on financial markets, with the sentence’s root-symbol as its root. The focus of this work is on building dependency parsers for target languages, assuming that an accurate English dependency parser and some parallel text between the two languages are available. Central to our approach is a maximizing likelihood learning framework, in which we use an English parser and parallel text to estimate the “transferring distribution” of the target language parsing model (See Section 2.2 for more details). Another advantage of the learning framework is that it combines both the likelihood on parallel data and confidence on unlabeled data, so that both parallel text and unlabeled data can be utilized in our approach.
In this paper, we will use the following notation: represents a generic input sentence, and represents a generic dependency tree. is used to denote the set of possible dependency trees for sentence . The probabilistic model for dependency parsing defines a family of conditional probability over all given sentence , with a log-linear form:
(1) |
where are feature functions, are parameters of the model, and is a normalization factor, which is commonly referred to as the partition function:
(2) |
A common strategy to make this parsing model efficiently computable is to factor dependency trees into sets of edges:
(3) |
That is, dependency tree is treated as a set of edges and each feature function is equal to the sum of all the features .
We denote the weight function of each edge as follows:
(4) |
and the conditional probability has the following form:
(5) |
One of the most common model training methods for supervised dependency parser is Maximum conditional likelihood estimation. For a supervised dependency parser with a set of training data , the logarithm of the likelihood (a.k.a. the log-likelihood) is given by:
(6) |
Maximum likelihood training chooses parameters such that the log-likelihood is maximized.
However, in our scenario we have no labeled training data for target languages but we have some parallel and unlabeled data plus an English dependency parser. For the purpose of transferring cross-lingual information from the English parser via parallel text, we explore the model training method proposed by Smith and Eisner [44], which presented a generalization of function [1], and related it to another semi-supervised learning technique, entropy regularization [20, 28]. The objective function to be minimized is actually the expected negative log-likelihood:
(7) | |||||
where and . is the “transferring distribution” that reflects our uncertainty about the true labels, and we are trying to learn a parametric model by minimizing the function.
In our scenario, we have a set of aligned parallel data where is the word alignment for the pair of source-target sentences , and a set of unlabeled sentences of the target language . We also have a trained English parsing model . Then the in equation (7) can be divided into two cases, according to whether belongs to parallel data set or unlabeled data set . For the unlabeled examples , some previous studies (e.g., [1]) simply use a uniform distribution over labels (e.g., parses), to reflect that the label is unknown. We follow the method in Smith and Eisner [44] and take the transferring distribution to be the actual current belief . The total contribution of the unsupervised examples to then simplifies to , which may be regarded as the entropy item used to constrain the model’s uncertainty to be low, as presented in the work on entropy regularization [20, 28].
But how can we define the transferring distribution for the parallel examples ? We define the transferring distribution by defining the transferring weight utilizing the English parsing model via parallel data with word alignments:
(8) |
where is the weight function of the English parsing model , and is the delexicalized form22The delexicalized form of an edge is an edge for which only delexicalized features are considered. of the edge . From the definition of the transferring weight, we can see that, if an edge of the target language sentence is aligned to an edge of the English sentence , we transfer the weight of edge to the corresponding weight of edge in the English parsing model . If the edge is not aligned to any edges of the English sentence , we reduce the edge to the delexicalized form and calculate the transferring weight in the English parsing model. There are two advantages for this definition of the transferring weight. First, by transferring the weight function to the corresponding weight in the well-developed English parsing model, we can project syntactic information across language boundaries. Second, McDonald et al. [34] demonstrates that parsers with only delexicalized features produce considerably high parsing performance. By reducing unaligned edges to their delexicalized forms, we can still use those delexicalized features, such as part-of-speech tags, for those unaligned edges, and can address problem that automatically generated word alignments include errors.
From the definition of transferring weight in equation (8), the transferring distribution can be defined in the following way:
(9) |
where
(10) |
Due to the normalizing factor , the transferring distribution is a valid one.
We introduce a multiplier as a trade-off between the two contributions (parallel and unsupervised) of the objective function , and the final objective function has the following form:
(11) | |||||
and are the contributions of the parallel and unsupervised data, respectively. One may regard as a Lagrange multiplier that is used to constrain the parser’s uncertainty H to be low, as presented in several studies on entropy regularization [5, 17, 20].
To train our parsing model, we need to find out the parameters that minimize the objective function in equation (11). This optimization problem is typically solved using quasi-Newton numerical methods such as L-BFGS [36], which requires efficient calculation of the objective function and the gradient of the objective function.
The first item () of the function in equation (11) can be rewritten in the following form:
(12) | |||||
and according to equation (1) and (3) the gradient of can be written as:
(13) | |||||
According to equation (9), can also be factored into the multiplication of the weight of each edge, so both and its gradient can be calculated by running the inside-outside algorithm [2, 41] for projective parsing. For non-projective parsing, the analogy to the inside algorithm is the matrix-tree algorithm based on Kirchhoff’s Matrix-Tree Theorem, which is dominated asymptotically by a matrix determinant [25, 46]. The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same complexity as evaluating the function.
The second item () of the function in equation (11) is the Shannon entropy of the posterior distribution over parsing trees, and can be written into the following form:
(14) | |||||
and the gradient of is in the following:
(15) | |||||
Similar with the calculation of , can also be computed by running the inside-outside algorithm [2, 41] for projective parsing. For the gradient of , both the two multipliers of the second item in equation (15) can be computed using the same inside-outside algorithm. For the first item in equation (15), an dynamic programming algorithm that is closely related to the forward-backward algorithm [28] for the entropy regularized CRF [20] can be used for projective parsing. For non-projective parsing, however, the runtime rises to . In this paper, we focus on projective parsing.
To summarize the description in the previous sections, our approach is performed in the following steps:
Train an English parsing model , which is used to estimate the transferring distribution .
Prepare parallel text by running word alignment method to obtain word alignments,33The word alignment methods do not require additional resources besides parallel text. and prepare the unlabeled data.
Train a parsing model for the target language by minimizing the objective function which is the combination of expected negative log-likelihood on parallel and unlabeled data.
#sents/#tokens | |||
training | dev | test | |
Version 1.0 | |||
de | 2,200/30,460 | 800/12,215 | 1,000/16,339 |
es | 3,345/94,232 | 370/10,191 | 300/8,295 |
fr | 3,312/74,979 | 366/8,071 | 300/6,950 |
ko | 5,308/62,378 | 588/6,545 | 298/2,917 |
sv | 4,447/66,631 | 493/9,312 | 1,219/20,376 |
Version 2.0 | |||
de | 14,118/26,4906 | 800/12,215 | 1,000/16,339 |
es | 14,138/37,5180 | 1,569/40,950 | 300/8,295 |
fr | 14,511/35,1233 | 1,611/38,328 | 300/6,950 |
it | 6,389/14,9145 | 400/9,541 | 400/9,187 |
ko | 5437/60,621 | 603/6,438 | 299/2,631 |
pt | 9,600/23,9012 | 1,200/29,873 | 1,198/29,438 |
sv | 4,447/66,631 | 493/9,312 | 1,219/20,376 |
In this section, we illustrate the data sets used in our experiments and the tools for data preparation.
# sents | ||||||
---|---|---|---|---|---|---|
500 | 1000 | 2000 | 5000 | 10000 | 20000 | |
da | 12,568 | 25,225 | 49,889 | 126,623 | 254,565 | 509,480 |
de | 13,548 | 26,663 | 53,170 | 133,596 | 265,589 | 527,407 |
el | 14,198 | 28,302 | 56,744 | 143,753 | 286,126 | 572,777 |
es | 15,147 | 29,214 | 57,526 | 144,621 | 290,517 | 579,164 |
fr | 15,046 | 29,982 | 60,569 | 153,874 | 306,332 | 609,541 |
it | 15,151 | 29,786 | 57,696 | 145,717 | 288,337 | 573,557 |
ko | 3,814 | 7,679 | 15,337 | 38,535 | 77,388 | 155,051 |
nl | 13,234 | 26,777 | 54,570 | 137,277 | 274,692 | 551,463 |
pt | 14,346 | 28,109 | 55,998 | 143,221 | 285,590 | 571,109 |
sv | 12,242 | 24,897 | 50,047 | 123,069 | 246,619 | 490,086 |
Our experiments rely on two kinds of data sets: (i) Monolingual Treebanks with consistent annotation schema — English treebank is used to train the English parsing model, and the Treebanks for target languages are used to evaluate the parsing performance of our approach. (ii) Large amounts of parallel text with English on one side. We select target languages based on the availability of these resources. The monolingual treebanks in our experiments are from the Google Universal Dependency Treebanks [31], for the reason that the treebanks of different languages in Google Universal Dependency Treebanks have consistent syntactic representations.
The parallel data come from the Europarl corpus version 7 [23] and Kaist Corpus44http://semanticweb.kaist.ac.kr/home/index.php/Corpus10. Taking the intersection of languages in the two kinds of resources yields the following seven languages: French, German, Italian, Korean, Portuguese, Spanish and Swedish.
The treebanks from CoNLL shared-tasks on dependency parsing [6, 38] appear to be another reasonable choice. However, previous studies [34, 31] have demonstrated that a homogeneous representation is critical for multilingual language technologies that require consistent cross-lingual analysis for downstream components, and the heterogenous representations used in CoNLL shared-tasks treebanks weaken any conclusion that can be drawn.
For comparison with previous studies, nevertheless, we also run experiments on CoNLL treebanks (see Section 4.4 for more details). We evaluate our approach on three target languages from CoNLL shared task treebanks, which do not appear in Google Universal Treebanks. The three languages are Danish, Dutch and Greek. So totally we have ten target languages. The parallel data for these three languages are also from the Europarl corpus version 7.
In our approach, word alignments for the parallel text are required. We perform word alignments with the open source GIZA++ toolkit55https://code.google.com/p/giza-pp/. The parallel corpus was preprocessed in standard ways, selecting sentences with the length in the range from 3 to 100. Then we run GIZA++ with the default setting to generate word alignments in both directions. We then make the intersection of the word alignments of two directions to generate one-to-one alignments.
DTP | DTP† | PTP† | -U | +U | OR | |
---|---|---|---|---|---|---|
de | 58.50 | 58.46 | 69.21 | 73.72 | 74.01 | 78.64 |
es | 68.07 | 68.72 | 72.57 | 75.32 | 75.60 | 82.56 |
fr | 70.14 | 71.13 | 74.60 | 76.65 | 76.93 | 83.69 |
ko | 42.37 | 43.57 | 53.72 | 59.72 | 59.94 | 89.85 |
sv | 70.56 | 70.59 | 75.87 | 78.91 | 79.27 | 85.59 |
Ave | 61.93 | 62.49 | 69.19 | 72.86 | 73.15 | 84.67 |
Several features in our parsing model involve part-of-speech (POS) tags of the input sentences. The set of POS tags needs to be consistent across languages and treebanks. For this reason we use the universal POS tag set of Petrov et al. [42]. This set consists of the following 12 coarse-grained tags: NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners), ADP (prepositions or postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), PUNC (punctuation marks) and X (a catch-all for other categories such as abbreviations or foreign words).
POS tags are not available for parallel data in the Europarl and Kaist corpus, so we need to provide the POS tags for these data. In our experiments, we train a Stanford POS Tagger [50] for each language. The labeled training data for each POS tagger are extracted from the training portion of each Treebanks. The average tagging accuracy is around 95%.
Undoubtedly, we are primarily interested in applying our approach to build statistical parsers for resource-poor target languages without any knowledge. For the purpose of evaluation of our approach and comparison with previous work, we need to exploit the gold POS tags to train the POS taggers. As part-of-speech tags are also a form of syntactic analysis, this assumption weakens the applicability of our approach. Fortunately, some recently proposed POS taggers, such as the POS tagger of Das and Petrov [13], rely only on labeled training data for English and the same kind of parallel text in our approach. In practice we can use this kind of POS taggers to predict POS tags, whose tagging accuracy is around 85%.
DTP† | PTP† | -U | +U | OR | |
---|---|---|---|---|---|
de | 58.56 | 69.77 | 73.92 | 74.30 | 81.65 |
es | 68.72 | 73.22 | 75.21 | 75.53 | 83.92 |
fr | 71.13 | 74.75 | 76.14 | 76.53 | 83.51 |
it | 70.74 | 76.08 | 77.55 | 77.74 | 85.47 |
ko | 38.55 | 43.34 | 59.71 | 59.89 | 90.42 |
pt | 69.82 | 74.59 | 76.30 | 76.65 | 85.67 |
sv | 70.59 | 75.87 | 78.91 | 79.27 | 85.59 |
Ave | 64.02 | 69.66 | 73.96 | 74.27 | 85.18 |
In this section, we will describe the details of our experiments and compare our results with previous methods.
Google Universal Treebanks V1.0 | |||||||||||||||
de | es | fr | ko | sv | |||||||||||
# sents | PTP† | -U | +U | PTP† | -U | +U | PTP† | -U | +U | PTP† | -U | +U | PTP† | -U | +U |
500 | 63.23 | 70.79 | 70.93 | 70.09 | 72.32 | 72.64 | 72.24 | 74.64 | 74.90 | 47.71 | 56.87 | 57.22 | 71.70 | 75.88 | 76.13 |
1000 | 65.61 | 71.71 | 71.86 | 70.90 | 73.44 | 73.67 | 72.95 | 75.07 | 75.35 | 47.83 | 57.65 | 58.15 | 72.38 | 76.55 | 77.03 |
2000 | 66.52 | 72.33 | 72.48 | 72.01 | 73.57 | 73.81 | 73.69 | 75.88 | 76.22 | 48.37 | 58.19 | 58.44 | 73.65 | 77.86 | 78.12 |
5000 | 67.79 | 73.06 | 73.31 | 72.34 | 74.30 | 74.79 | 74.31 | 76.02 | 76.29 | 53.02 | 58.57 | 59.04 | 74.88 | 78.48 | 78.70 |
10000 | 68.44 | 73.59 | 73.92 | 72.48 | 74.86 | 75.26 | 74.43 | 76.14 | 76.34 | 53.61 | 59.17 | 59.55 | 75.34 | 78.78 | 79.08 |
20000 | 69.21 | 73.72 | 74.01 | 72.57 | 75.32 | 75.60 | 74.60 | 76.55 | 76.93 | 53.72 | 59.72 | 59.94 | 75.87 | 78.91 | 79.27 |
Google Universal Treebanks V2.0 | |||||||||||||||
de | es | fr | ko | it | |||||||||||
# sents | PTP† | -U | +U | PTP† | -U | +U | PTP† | -U | +U | PTP† | -U | +U | PTP† | -U | +U |
500 | 60.10 | 71.07 | 71.39 | 69.52 | 72.97 | 73.28 | 71.10 | 74.57 | 74.70 | 40.09 | 56.60 | 57.10 | 72.80 | 75.67 | 75.94 |
1000 | 61.76 | 72.15 | 72.39 | 70.78 | 73.48 | 73.79 | 72.14 | 75.13 | 75.43 | 40.44 | 57.55 | 57.93 | 73.55 | 76.43 | 76.67 |
2000 | 65.35 | 72.73 | 73.04 | 71.75 | 74.10 | 74.35 | 73.21 | 75.78 | 76.06 | 40.87 | 58.11 | 58.43 | 74.44 | 76.99 | 77.39 |
5000 | 67.86 | 73.32 | 73.62 | 72.43 | 74.55 | 74.83 | 74.14 | 75.83 | 76.02 | 40.90 | 58.48 | 58.96 | 75.07 | 77.10 | 77.34 |
10000 | 68.70 | 73.71 | 74.02 | 72.85 | 74.80 | 74.95 | 74.53 | 75.97 | 76.17 | 41.29 | 59.13 | 59.44 | 75.65 | 77.50 | 77.71 |
20000 | 69.77 | 73.92 | 74.30 | 73.22 | 75.21 | 75.53 | 74.75 | 76.14 | 76.53 | 43.34 | 59.71 | 59.89 | 76.08 | 77.55 | 77.74 |
pt | |||||||||||||||
# sents | PTP† | -U | +U | ||||||||||||
500 | 71.34 | 74.41 | 74.68 | ||||||||||||
1000 | 71.91 | 74.48 | 75.08 | ||||||||||||
2000 | 72.93 | 75.10 | 75.32 | ||||||||||||
5000 | 73.78 | 75.88 | 75.98 | ||||||||||||
10000 | 74.40 | 75.99 | 76.15 | ||||||||||||
20000 | 74.59 | 76.30 | 76.65 |
As presented in Section 3.1, we evaluate our parsing approach on both version 1.0 and version 2.0 of Google Univereal Treebanks for seven languages66Japanese and Indonesia are excluded as no practicable parallel data are available.. We use the standard splits of the treebank for each language as specified in the release of the data77https://code.google.com/p/uni-dep-tb/. Table 1 presents the statistics of the two versions of Google Universal Treebanks. We strip all the dependency annotations off the training portion of each treebank, and use that as the unlabeled data for that target language. We train our parsing model with different numbers of parallel sentences to analyze the influence of the amount of parallel data on the parsing performance of our approach. The parallel data sets contain 500, 1000, 2000, 5000, 10000 and 20000 parallel sentences, respectively. We randomly extract parallel sentences from each corpora, and smaller data sets are subsets of larger ones. Table 2 shows the number of tokens in the parallel data used in the experiments.
For the comparison of parsing performance, we run experiments on the following systems:
The direct transfer parser (DTP) proposed by McDonald et al. [34], who train a delexicalized parser on English labeled training data with no lexical features, then apply this parser to parse target languages directly. It is based on the transition-based dependency parsing paradigm [40]. We directly cite the results reported in McDonald et al. [31]. In addition to their original results, we also report results by re-implementing the direct transfer parser based on the first-order projective dependency parsing model [30] (DTP†).
The projected transfer parser (PTP) described in McDonald et al. [34]. The results of the projected transfer parser re-implemented by us is marked as “PTP†”.
Our approach training on only parallel data without unlabeled data for the target language. The parallel data set for each language contains 20,000 sentences.
Our approach training on both parallel and unlabeled data. The parallel data sets are the ones contains 20,000 sentences.
Parsing accuracy is measured with unlabeled attachment score (UAS): the percentage of words with the correct head.
Table 3 and Table 4 shows the parsing results of our approach, together with the results of the baseline systems and the oracle, on version 1.0 and version 2.0 of Google Universal Treebanks, respectively. Our approaches significantly outperform all the baseline systems across all the seven target languages. For the results on Google Universal Treebanks version 1.0, the improvement on average over the projected transfer paper (PTP†) is 3.96% and up to 6.22% for Korean and 4.80% for German. For the other three languages, the improvements are remarkable, too — 2.33% for French, 3.03% for Spanish and 3.40% for Swedish. By adding entropy regularization from unlabeled data, our full model achieves average improvement of 0.29% over the “-U” setting. Moreover, our approach considerably bridges the gap to fully supervised dependency parsers, whose average UAS is 84.67%. For the results on treebanks version 2.0, we can get similar observation and draw the same conclusion.
Table 5 illustrates the UAS of our approach trained on different amounts of parallel data, together with the results of the projected transfer parser re-implemented by us (PTP†). We run two versions of our approach for each of the parallel data sets, one with unlabeled data (+U) and the other without them (-U). From table 5 we can get three observations. First, even the parsers trained with only 500 parallel sentences achieve considerably high parsing accuracies (average 70.10% for version 1.0 and 71.59% for version 2.0). This demonstrates that our approach does not rely on a large amount of parallel data. Second, when gradually increasing the amount of parallel data, the parsing performance continues improving. Third, entropy regularization with unlabeled data makes modest improvement on parsing performance over the parsers without unlabeled data. This proves the effectiveness of the entropy regularization from unlabeled data.
To make a thorough empirical comparison with previous studies, we also evaluate our system without unlabeled data (-U) on treebanks from CoNLL shared task on dependency parsing [6, 38]. To facilitate comparison, we use the same eight Indo-European languages as target languages: Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish, and same experimental setup as McDonald et al. [34]. We report both the results of the direct transfer and projected transfer parsers directly cited from McDonald et al. [34] (DTP and PTP) and re-implemented by us (DTP†and PTP†).
Table 6 gives the results comparing the model without unlabeled data (-U) presented in this work to those five baseline systems and the oracle (OR). The results of unsupervised DMV model [22] are from Table 1 of McDonald et al. [34]. Our approach outperforms all these baseline systems and achieves state-of-the-art performance on all the eight languages.
In order to compare with more previous methods, we also report parsing performance on sentences of length 10 or less after punctuation has been removed. Table 7 shows the results of our system and the results of baseline systems. “USR†” is the weakly supervised system of Naseem et al. [35]. “PGI” is the phylogenetic grammar induction model of Berg-Kirkpatrick and Klein [3]. Both the results of the two systems are cited from Table 4 of McDonald et al. [34]. We also include the results of the unsupervised dependency parsing model with non-parallel multilingual guidance (NMG) proposed by Cohen et al. [11]88For each language, we use the best result of the four systems in Table 3 of Cohen et al. [11], and “PR” which is the posterior regularization approach presented in Gillenwater et al. [15]. All the results are shown in Table 7.
From Table 7, we can see that among the eight target languages, our approach achieves best parsing performance on six languages — Danish, German, Greek, Italian, Portuguese and Swedish. It should be noted that the “NMG” system utilizes more than one helper languages. So it is not directly comparable to our work.
DMV | DTP | DTP† | PTP | PTP† | -U | OR | |
---|---|---|---|---|---|---|---|
da | 33.4 | 45.9 | 46.8 | 48.2 | 50.0 | 50.1 | 87.1 |
de | 18.0 | 47.2 | 46.0 | 50.9 | 52.4 | 57.3 | 87.0 |
el | 39.9 | 63.9 | 62.9 | 66.8 | 65.3 | 67.4 | 82.3 |
es | 28.5 | 53.3 | 54.4 | 55.8 | 59.9 | 60.3 | 83.6 |
it | 43.1 | 57.7 | 59.9 | 60.8 | 63.4 | 64.0 | 83.9 |
nl | 38.5 | 60.8 | 60.7 | 67.8 | 66.5 | 68.2 | 78.2 |
pt | 20.1 | 69.2 | 71.1 | 71.3 | 74.8 | 75.1 | 87.2 |
sv | 44.0 | 58.3 | 60.3 | 61.3 | 62.8 | 66.7 | 88.0 |
Ave | 33.2 | 57.0 | 57.8 | 60.4 | 61.9 | 63.6 | 84.7 |
DTP | DTP† | PTP | PTP† | USR† | PGI | PR | NMG | -U | |
---|---|---|---|---|---|---|---|---|---|
da | 53.2 | 55.3 | 57.4 | 59.8 | 55.1 | 41.6 | 44.0 | 59.9 | 60.1 |
de | 65.9 | 57.9 | 67.0 | 63.5 | 60.0 | — | — | — | 67.5 |
el | 73.9 | 70.8 | 73.9 | 72.3 | 60.3 | — | — | 73.0 | 74.3 |
es | 58.0 | 62.3 | 62.3 | 66.1 | 68.3 | 58.4 | 62.4 | 76.7 | 64.6 |
it | 65.5 | 66.9 | 69.9 | 71.5 | 47.9 | — | — | — | 73.6 |
nl | 67.6 | 66.0 | 72.2 | 72.1 | 44.0 | 45.1 | 37.9 | 50.7 | 70.5 |
pt | 77.9 | 79.2 | 80.6 | 82.9 | 70.9 | 63.0 | 47.8 | 79.8 | 83.3 |
sv | 70.4 | 70.2 | 71.3 | 70.4 | 52.6 | 58.3 | 42.2 | 74.0 | 75.1 |
Ave | 66.6 | 66.1 | 69.4 | 69.8 | 57.4 | — | — | — | 71.1 |
In this section, we briefly outline a few extensions to our approach that we want to explore in future work.
As mentioned in section 2.3, the runtime to compute and its gradient is . One reasonable speedup, as presented in Smith and Eisner [44], is to replace Shannon entropy with Rényi entropy. The Rényi entropy is parameterized by :
(16) |
With Rényi entropy, the computation of and its gradient is , even for non-projective case.
Our learning framework can be extended to higher-order dependency parsing models. For example, if we want to make our model capable of utilizing more contextual information, we can extend our transferring weight to higher-order parts:
(17) |
where is a small part of tree that has limited interactions. For projective parsing, several algorithms [33, 8, 24, 27] have been proposed to solve the model training problems (calculation of objective function and gradient) for different factorizations.
One possible direction to improve our approach is to replace parallel text with Interlinear Glossed Text (IGT) [26], which is a semi-structured data type encoding more syntactic information than parallel data. By using IGT Data, not only can we obtain more accurate word alignments, but also extract useful cross-lingual information for the resource-poor language.
In this paper, we propose an unsupervised projective dependency parsing approach for resource-poor languages, using existing resources from a resource-rich source language. By presenting a model training framework, our approach can utilize parallel text to estimate transferring distribution with the help of a well-developed resource-rich language dependency parser, and use unlabeled data as entropy regularization. The experimental results on three data sets across ten target languages show that our approach achieves significant improvement over previous studies.
This material is based upon work supported by the National Science Foundation under Grant No. BCS-0748919. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.