Large-scale discriminative training has become promising for statistical machine translation by leveraging the huge training corpus; for example the recent effort in phrase-based MT [21] significantly outperforms mainstream methods that only train on small tuning sets. However, phrase-based MT suffers from limited reorderings, and thus its training can only utilize a small portion of the bitext due to the distortion limit. To address this problem, we extend \newciteyu+:2013 to syntax-based MT by generalizing their latent variable “violation-fixing” perceptron from graphs to hypergraphs. Experiments confirm that our method leads to up to +1.2 Bleuimprovement over mainstream methods such as Mertand Pro.
1.5em
1em
algorithmAlgorithm
Many natural language processing problems including part-of-speech tagging [7], parsing [18], and event extraction [16] have enjoyed great success using large-scale discriminative training algorithms. However, a similar success on machine translation has been elusive, where the mainstream methods still tune on small datasets.
What makes large-scale MT training so hard then? After numerous attempts by various researchers [17, 20, 1, 2, 5, 9, 10], the recent work of \newciteyu+:2013 finally reveals a major reason: it is the vast amount of (inevitable) search errors in MT decoding that astray learning. To alleviate this problem, their work adopts the theoretically-motivated framework of violation-fixing perceptron [12] tailed for inexact search, yielding great results on phrase-based MT (outperforming small-scale Mert/Proby a large margin for the first time). However, the underlying phrase-based model suffers from limited distortion and thus can only employ a small portion (about 1/3 in their Ch-En experiments) of the bitext in training.
Collins (02) | Huang et al. (12) | Yu et al. (13) | ||
hypergraph | ||||
Zhang et al. (13) | this work |
\arraybackslash
|
\arraybackslash | \arraybackslash
Bùshí yǔ Shālóng jǔxíng huìtán Bush with Sharon held talks | ||||||||||||||||||
\arraybackslash(a) Hierorules | \arraybackslash(b) gold derivation | \arraybackslash(c) Viterbi derivation |
To better utilize the large training set, we propose to generalize from phrase-based MT to syntax-based MT, in particular the hierarchical phrase-based translation model (Hiero) [6], in order to exploit sentence pairs beyond the expressive capacity of phrase-based MT.
The key challenge here is to extend the latent variable violation-fixing perceptron of \newciteyu+:2013 to handle tree-structured derivations and translation hypergraphs. Luckily, \newcitezhang+:2013 have recently generalized the underlying violation-fixing perceptron of \newcitehuang+:2012 from graphs to hypergraphs for bottom-up parsing, which resembles syntax-based decoding. We just need to further extend it to handle latent variables. We make the following contributions:
We generalize the latent variable violation-fixing perceptron framework to inexact search over hypergraphs, which subsumes previous algorithms for PBMT and bottom-up parsing as special cases (see Fig. 1).
We show that syntax-based MT, with its better handling of long-distance reordering, can exploit a larger portion of the training set, which facilitates sparse lexicalized features.
Experiments show that our training algorithm outperforms mainstream tuning methods (which optimize on small devsets) by +1.2 Bleuover Mertand Proon FBIS.
For clarity reasons we will describe Hierodecoding as a two-pass process, first without a language model, and then integrating the LM. This section mostly follows \newcitehuang+chiang:2007.
In the first, phase, the decoder parses the source sentence using the source projection of the synchronous grammar (see Fig. 2 (a) for an example), producing a hypergraph where each node has a signature , where is the nonterminal type (either X or S in Hiero) and is the span, and each hyperedge is an application of the translation rule (see Figure 3).
To incorporate the language model, each node also needs to remember its target side boundary words. Thus a node is split into multiple nodes of signature , where and are the boundary words. For example, with a bigram LM, is a node whose translation starts with “held” and ends with “Sharon”.
More formally, the whole decoding process can be cast as a deductive system. Take the partial translation of “held talks with Sharon” in Figure 2 (b) for example, the deduction is
where is the score of rule , and the LM combo score is .
As mentioned in Section 1, the key to the success of \newciteyu+:2013 is the adoption of violation-fixing perceptron of \newcitehuang+:2012 which is tailored for vastly inexact search. The general idea is to update somewhere in the middle of the search (where search error happens) rather than at the very end (standard update is often invalid). To adapt it to MT where many derivations can output the same translation (i.e., spurious ambiguity), \newciteyu+:2013 extends it to handle latent variables which correspond to phrase-based derivations. On the other hand, \newcitezhang+:2013 has generalized \newcitehuang+:2012 from graphs to hypergraphs for bottom-up parsing, which resembles Hierodecoding. So we just need to combine the two generalizing directions (latent variable and hypergraph, see Fig. 1).
The key difference between bottom-up parsing and MT decoding is that in parsing the gold tree for each input sentence is unique, while in MT many derivations can generate the same reference translation. In other words, the gold derivation to update towards is a latent variable.
Here we formally define the latent variable “max-violation” perceptron over a hypergraph for MT training. For a given sentence pair , we denote as the decoding hypergraph of Hierowithout any pruning. We say if is a full derivation of decoding , and can be derived from the hypergraph. Let be the set of -goodderivations for :
where is the translation from derivation . We then define the set of -goodpartial derivations that cover with root as
We further denote the real decoding hypergraph with beam-pruning and cube-pruning as . The set of -badderivations is defined as
Note that the -goodderivations are defined over the unpruned whole decoding hypergraph, while the -badderivations are defined over the real decoding hypergraph with pruning.
The max-violation method performs the update where the model score difference between the incorrect Viterbi partial derivation and the best -goodpartial derivation is maximal, by penalizing the incorrect Viterbi partial derivation and rewarding the -goodpartial derivation.
More formally, we first find the Viterbi partial derivation and the best -goodpartial derivation for each group in the pruned hypergraph:
where is the feature vector for derivation . Then it finds the group with the maximal score difference between the Viterbi derivation and the best -goodderivation:
and update as follows:
where .
We now describe how to find the gold derivations.11We only consider single reference in this paper. Such derivations can be generated in way similar to \newciteyu+:2013 by using a language model tailored for forced decoding:
where and are the indices of the boundary words in the reference translation. The node now has signature , where and are the indexes of the boundary words. If a boundary word does not occur in the reference, its index is set to so that its language model score will always be ; if a boundary word occurs more than once in the reference, its node is split into multiple nodes, one for each such index.22Our formulation of index-based language model fixes a bug in the word-based LM of \newciteyu+:2013 when a substring appears more than once in the reference (e.g. “the man…the man…”); thanks to Dan Gildea for pointing it out.
Following \newciteyu+:2013, we call our max-violation method MaxForce. Our implementation is mostly in Python on top of the cdecsystem [8] via the pycdec interface [3]. In addition, we use minibatch parallelization of [22] to speedup perceptron training. We evaluate MaxForcefor Hieroover two Ch-Encorpora, IWSLT09 and FBIS, and compare the performance with vanilla -best Mert[19] from Moses [14], Hypergraph Mert[15], and Pro[11] from cdec.
We use all the 18 dense features from cdec, including language model, direct translation probability , lexical translation probabilities and , length penalty, counts for the source and target sides in the training corpus, and flags for the glue rules and pass-through rules.
For sparse features we use Word-Edges features [4, 13] which are shown to be extremely effective in both parsing and phrase-based MT [21]. We find that even simple Word-Edges features boost the performance significantly, and adding complex Word-Edges features from \newciteyu+:2013 brings limited improvement and slows down the decoding. So in the following experiments we only use Word-Edges features consisting of combinations of English and Chinese words, and Chinese characters, and do not use word clusters nor word types. For simplicity and efficiency reasons, we also exclude all non-local features.
Our first corpus, IWSLT09, contains 30k short sentences collected from spoken language. IWSLT04 is used as development set in MaxForcetraining, and as tuning set for -best Mert, Hypergraph Mert, and Pro. IWSLT05 is used as test set. Both IWSLT04 and IWSLT05 contain 16 references.We mainly use this corpus to investigate the properties of MaxForce.
The second corpus, FBIS, contains 240k sentences. NIST06 newswire is used as development set for MaxForcetraining, and as tuning set for all other tuning methods. NIST08 newswire is used as test set. Both NIST06 newswire and NIST08 newswire contain 4 references. We mainly use this corpus to demonstrate the performance of MaxForcein large-scale training.
For both corpora, we do standard tokenization, alignment and rule extraction using the cdectools. In rule extraction, we remove all 1-count rules but keep the rules mapping from one Chinese word to one English word to help balancing between overfitting and coverage. We use a trigram language model trained from the target sides of the two corpora respectively.
sent. | words | |
---|---|---|
phrase-based MT | 32% | 12% |
Hiero | 35% | 30% |
Hiero (all rules) | 65% | 55% |
We first report the forced decoding reachability for Hieroon FBIS in Table 1. With the full rule set, 65% sentences and 55% words of the whole corpus are forced decodable in Hiero. After pruning 1-count rules, our forced decoding covers significantly more words than phrase-based MT in \newciteyu+:2013. Furthermore, in phrase-based MT, most decodable sentences are very short, while in Hierothe lengths of decodable sentences are more evenly distributed.
However, in the following experiments, due to efficiency considerations, we use the “tight” rule extraction in cdecthat is more strict than the standard “loose” rule extraction, which generates a reduced rule set and, thus, a reduced reachability. We show the reachability distributions of both tight and loose rule extraction in Figure 4.
For IWSLT, we first compare the performance from various update methods in Figure 5. The max-violation method is more than 15 Bleupoints better than the standard perceptron (also known as “bold-update” in \newciteliang+:2006) which updates at the root of the derivation tree.33We find that while MaxForcegenerates translations of length ratio close to 1 during training, the length ratios on dev/test sets are significantly lower, due to OOVs. So we run a binary search for the length penalty weight after each training iteration to tune the length ratio to 0.97 on dev set.44 We report Bleuwith averaged reference lengths. This can be explained by the fact that in training 58% of the standard updates are invalid (i.e., they do not fix any violation). We also use the “skip” strategy of \newcitezhang+:2013 which updates at the root of the derivation only when it fixes a search error, avoiding all invalid updates. This achieves 10 Bleubetter than the standard update, but is still more than 5 Bleuworse than Max-Violation update. Finally we also try the “local-update” method from \newciteliang+:2006 which updates towards the derivation with the best in the root group . This method is about 2 Bleupoints worse than max-violation.
We further investigate the contribution of sparse features in Figure 6. On the development set, max-violation update without Word-Edges features achieves Bleusimilar to -best Mertand Pro, but lower than Hypergraph Mert. Adding simple Word-Edges features improves Bleuby 2 points, outperforming the very strong Hypergraph Mertbaseline by 1 point. See Table 2 for details. The results of -best Mert, Hypergraph Mert, and Proare averages from 3 runs.
algorithm | # feats | dev | test |
---|---|---|---|
-best Mert | 18 | 44.9 | 47.9 |
Hypergraph Mert | 18 | 46.6 | 50.7 |
Pro | 18 | 45.0 | 49.5 |
local update perc. | 443K | 45.6 | 49.1 |
MaxForce | 529K | 47.4 | 51.5 |
algorithm | # feats | dev | test |
---|---|---|---|
Hypergraph Mert | 18 | 27.3 | 23.0 |
Pro | 18 | 26.4 | 22.7 |
MaxForce | 4.5M | 27.7 | 23.9 |
Table 3 shows Bleuscores of Hypergraph Mert, Pro, and MaxForceon FBIS. MaxForceactives 4.5M features, and achieves +1.2 Bleuover Proand +0.9 Bleuover Hypergraph Mert. The training time (on 32 cores) for Hypergraph Mertand Prois about 30 min. on the dev set, and is about 5 hours for MaxForceon the training set.
We have presented a latent-variable violation-fixing framework for general structured prediction problems with inexact search over hypergraphs. Its application on Hierobrings significant improvement in Bleu, compared to algorithms that are specially designed for MT tuning such as Mertand Pro.
Part of this work was done during K. Z.’s internship at IBM. We thank Martin Čmejrek and Lemao Liu for discussions, David Chiang for pointing us to pycdec, Dan Gildea for Footnote 3.2, and the anonymous reviewers for comments. This work is supported by DARPA FA8750-13-2-0041 (DEFT), DARPA HR0011-12-C-0015 (BOLT), and a Google Faculty Research Award.