In this paper we explicitly consider sentence skeleton information for Machine Translation (MT). The basic idea is that we translate the key elements of the input sentence using a skeleton translation model, and then cover the remain segments using a full translation model. We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
GBKsong
GBKsong
Current Statistical Machine Translation (SMT) approaches model the translation problem as a process of generating a derivation of atomic translation units, assuming that every unit is drawn out of the same model. The simplest of these is the phrase-based approach [Och et al.1999, Koehn et al.2003] which employs a global model to process any sub-strings of the input sentence. In this way, all we need is to increasingly translate a sequence of source words each time until the entire sentence is covered. Despite good results in many tasks, such a method ignores the roles of each source word and is somewhat different from the way used by translators. For example, an important-first strategy is generally adopted in human translation - we translate the key elements/structures (or skeleton) of the sentence first, and then translate the remaining parts. This especially makes sense for some languages, such as Chinese, where complex structures are usually involved.
Note that the source-language structural information has been intensively investigated in recent studies of syntactic translation models. Some of them developed syntax-based models on complete syntactic trees with Treebank annotations [Liu et al.2006, Huang et al.2006, Zhang et al.2008], and others used source-language syntax as soft constraints [Marton and Resnik2008, Chiang2010]. However, these approaches suffer from the same problem as the phrase-based counterpart and use the single global model to handle different translation units, no matter they are from the skeleton of the input tree/sentence or other not-so-important sub-structures.
In this paper we instead explicitly model the translation problem with sentence skeleton information. In particular,
We develop a skeleton-based model which divides translation into two sub-models: a skeleton translation model (i.e., translating the key elements) and a full translation model (i.e., translating the remaining source words and generating the complete translation).
We develop a skeletal language model to describe the possibility of translation skeleton and handle some of the long-distance word dependencies.
We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.
The first issue that arises is how to identify the skeleton for a given source sentence. Many ways are available. E.g., we can start with a full syntactic tree and transform it into a simpler form (e.g., removing a sub-tree). Here we choose a simple and straightforward method: a skeleton is obtained by dropping all unimportant words in the original sentence, while preserving the grammaticality. See the following for an example skeleton of a Chinese sentence.
Original Sentence (subscripts represent indices):
ÿ
per
¶Ö
ton
º£Ë®µ»¯
seawater desalination
´¦Àí
treatment
µÄ
of
³É±¾
the cost
ÔÚ
5
5
Ôª
yuan
µÄ
of
»ù´¡
from
ÉÏ
½øÒ»²½
has been further
Ͻµ
reduced
¡£
.
(The cost of seawater desalination treatment has been further reduced from 5 yuan per ton.)
Sentence Skeleton (subscripts represent indices):
³É±¾
the cost
½øÒ»²½
has been further
Ͻµ
reduced
¡£
.
(The cost has been further reduced.)
Obviously the skeleton used in this work can be viewed as a simplified sentence. Thus the problem is in principle the same as sentence simplification/compression. The motivations of defining the problem in this way are two-fold. First, as the skeleton is a well-formed (but simple) sentence, all current MT approaches are applicable to the skeleton translation problem. Second, obtaining simplified sentences by word deletion is a well-studied issue [Knight and Marcu2000, Clarke and Lapata2006, Galley and McKeown2007, Cohn and Lapata2008, Yamangil and Shieber2010, Yoshikawa et al.2012]. Many good sentence simpliciation/compression methods are available to our work. Due to the lack of space, we do not go deep into this problem. In Section 3.1 we describe the corpus and system employed for automatic generation of sentence skeletons.
Next we describe our approach to integrating skeleton information into MT models. We start with an assumption that the 1-best skeleton is provided by the skeleton identification system. Then we define skeleton-based translation as a task of searching for the best target string given the source string and its skeleton :
(1) |
As is standard in SMT, we further assume that 1) the translation process can be decomposed into a derivation of phrase-pairs (for phrase-based models) or translation rules (for syntax-based models); 2) and a linear function is used to assign a model score to each derivation. Let (or for short) denote a translation derivation. The above problem can be redefined in a Viterbi fashion - we find the derivation with the highest model score given and :
(2) |
In this way, the MT output can be regarded as the target-string encoded in .
To compute , we use a linear combination of a skeleton translation model and a full translation model :
(3) |
where the skeleton translation model handles the translation of the sentence skeleton, while the full translation model is the baseline model and handles the original problem of translating the whole sentence. The motivation here is straightforward: we use an additional score to model the problem of skeleton translation and interpolate it with the baseline model. See Figure 1 for an example of applying the above model to phrase-based MT. In the figure, each source phrase is translated into a target phrase, which is represented by linked rectangles. The skeleton translation model focuses on the translation of the sentence skeleton, i.e., the solid (red) rectangles; while the full translation model computes the model score for all those phrase-pairs, i.e., all solid and dashed rectangles.
Another note on the model. Eq. (3) provides a very flexible way for model selection. While we will restrict ourself to phrase-based translation in the following description and experiments, we can choose different models/features for and . E.g., one may introduce syntactic features into due to their good ability in capturing structural information; and employ a standard phrase-based model for in which not all segments of the sentence need to respect syntactic constraints.
In this work both the skeleton translation model and full translation model resemble the usual forms used in phrase-based MT, i.e., the model score is computed by a linear combination of a group of phrase-based features and language models. In phrase-based MT, the translation problem is modeled by a derivation of phrase-pairs. Given a translation model , a language model and a vector of feature weights , the model score of a derivation is computed by
(4) |
where is a vector of feature values defined on , and is the corresponding weight vector. and are the score and weight of the language model, respectively.
To ease modeling, we only consider skeleton-consistent derivations in this work. A derivation is skeleton-consistent if no phrases in cross skeleton boundaries (e.g., a phrase where two of the source words are in the skeleton and one is outside). Obviously, from any skeleton-consistent derivation we can extract a skeleton derivation which covers the sentence skeleton exactly. For example, in Figure 1, the derivation of phrase-pairs is skeleton-consistent, and the skeleton derivation is formed by .
Then, we can simply define and as the model scores of and :
(5) | |||||
(6) |
This model makes the skeleton translation and full translation much simpler because they perform in the same way of string translation in phrase-based MT. Both and share the same translation model which can easily learned from the bilingual data11In , we compute the reordering model score on the skeleton though it is learned from the full sentences. In this way the reordering problems in skeleton translation and full translation are distinguished and handled separately.. On the other hand, it has different feature weight vectors for individual models (i.e., and ).
For language modeling, is the standard -gram language model adopted in the baseline system. is a skeletal language for estimating the well-formedness of the translation skeleton. Here a translation skeleton is a target string where all segments of non-skeleton translation are generalized to a symbol X. E.g., in Figure 1, the translation skeleton is ’the cost X has been further reduced X .’, where two Xs represent non-skeleton segments in the translation. In such a way of string representation, the skeletal language model can be implemented as a standard -gram language model, that is, a string probability is calculated by a product of a sequence of -gram probabilities (involving normal words and X). To learn the skeletal language model, we replace non-skeleton parts of the target sentences in the bilingual corpus to Xs using the source sentence skeletons and word alignments. The skeletal language model is then trained on these generalized strings in a standard way of -gram language modeling.
By substituting Eq. (4) into Eqs. (5) and (6), and then Eqs. (3) and (2), we have the final model used in this work:
(7) | |||||
Figure 1 shows the translation process and associated model scores for the example sentence. Note that this method does not require any new translation models for implementation. Given a baseline phrase-based system, all we need is to learn the feature weights and on the development set (with source-language skeleton annotation) and the skeletal language model on the target-language side of the bilingual corpus. To implement Eq. (7), we can perform standard decoding while ”doubly weighting” the phrases which cover a skeletal section of the sentence, and combining the two language models and the translation model in a linear fashion.
Entry | MT06 (Dev) | MT04 | MT05 | All | ||||||
---|---|---|---|---|---|---|---|---|---|---|
system | dev-skel | test-skel | BLEU | TER | BLEU | TER | BLEU | TER | BLEU | TER |
baseline | - | - | 35.06 | 60.54 | 38.53 | 61.15 | 34.32 | 62.82 | 36.64 | 61.54 |
SBMT | manual | manual | 35.71 | 59.60 | 38.99 | 60.67 | 35.35 | 61.60 | 37.30 | 60.73 |
SBMT | manual | auto | 35.72 | 59.62 | 38.75 | 61.16 | 35.02 | 62.20 | 37.03 | 61.19 |
SBMT | auto | auto | 35.57 | 59.66 | 39.21 | 60.59 | 35.29 | 61.89 | 37.33 | 60.80 |
auto | auto | 35.23 | 60.17 | 38.86 | 60.78 | 34.82 | 62.46 | 36.99 | 61.16 | |
auto | auto | 35.50 | 59.69 | 39.00 | 60.69 | 35.10 | 62.03 | 37.12 | 60.90 | |
s-space | - | - | 35.00 | 60.50 | 38.39 | 61.20 | 34.33 | 62.90 | 36.57 | 61.58 |
s-feat. | - | - | 35.16 | 60.50 | 38.60 | 61.17 | 34.25 | 62.88 | 36.70 | 61.58 |
We experimented with our approach on Chinese-English translation using the NiuTrans open-source MT toolkit [Xiao et al.2012]. Our bilingual corpus consists of 2.7M sentence pairs. All these sentences were aligned in word level using the GIZA++ system and the ”grow-diag-final-and” heuristics. A 5-gram language model was trained on the Xinhua portion of the English Gigaword corpus in addition to the target-side of the bilingual data. This language model was used in both the baseline and our improved systems. For our skeletal language model, we trained a 5-gram language model on the target-side of the bilingual data by generalizing non-skeleton segments to Xs. We used the newswire portion of the NIST MT06 evaluation data as our development set, and used the evaluation data of MT04 and MT05 as our test sets. We chose the default feature set of the NiuTrans.Phrase engine for building the baseline, including phrase translation probabilities, lexical weights, a 5-gram language model, word and phrase bonuses, a ME-based lexicalized reordering model. All feature weights were learned using minimum error rate training [Och2003].
Our skeleton identification system was built using the t3 toolkit22http://staffwww.dcs.shef.ac.uk/people/T.Cohn/t3/ which implements a state-of-the-art sentence simplification system. We used the NEU Chinese sentence simplification (NEUCSS) corpus as our training data [Zhang et al.2013]. It contains the annotation of sentence skeleton on the Chinese-language side of the Penn Parallel Chinese-English Treebank (LDC2003E07). We trained our system using the Parts 1-8 of the NEUCSS corpus and obtained a 65.2% relational F1 score and 63.1% compression rate in held-out test (Part 10). For comparison, we also manually annotated the MT development and test data with skeleton information according to the annotation standard provided within NEUCSS.
Table 1 shows the case-insensitive IBM-version BLEU and TER scores of different systems. We see, first of all, that the MT system benefits from our approach in most cases. In both the manual and automatic identification of sentence skeleton (rows 2 and 4), there is a significant improvement on the ”All” data set. However, using different skeleton identification results for training and inference (row 3) does not show big improvements due to the data inconsistency problem.
Another interesting question is whether the skeletal language model really contributes to the improvements. To investigate it, we removed the skeletal language model from our skeleton-based translation system (with automatic skeleton identification on both the development and test sets). Seen from row of Table 1, the removal of the skeletal language model results in a significant drop in both BLEU and TER performance. It indicates that this language model is very beneficial to our system. For comparison, we removed the skeleton-based translation model from our system as well. Row of Table 1 shows that the skeleton-based translation model can contribute to the overall improvement but there is no big differences between baseline and .
Apart from showing the effects of the skeleton-based model, we also studied the behavior of the MT system under the different settings of search space. Row s-space of Table 1 shows the BLEU and TER results of restricting the baseline system to the space of skeleton-consistent derivations, i.e., we remove both the skeleton-based translation model and language model from the SBMT system. We see that the limited search space is a little harmful to the baseline system. Further, we regarded skeleton-consistent derivations as an indicator feature and introduced it into the baseline system. Seen from row s-feat., this feature does not show promising improvements. These results indicate that the real improvements are due to the skeleton-based model/features used in this work, rather than the ”well-formed” derivations.
Skeleton is a concept that has been used in several sub-areas in MT for years. For example, in confusion network-based system combination it refers to the backbone hypothesis for building confusion networks [Rosti et al.2007, Rosti et al.2008]; Liu et al.2011 regard skeleton as a shortened sentence after removing some of the function words for better word deletion. In contrast, we define sentence skeleton as the key segments of a sentence and develop a new MT approach based on this information.
There are some previous studies on the use of sentence skeleton or related information in MT [Mellebeek et al.2006a, Mellebeek et al.2006b, Owczarzak et al.2006]. In spite of their good ideas of using skeleton skeleton information, they did not model the skeleton-based translation problem in modern SMT pipelines. Our work is a further step towards the use of sentence skeleton in MT. More importantly, we develop a complete approach to this issue and show its effectiveness in a state-of-the-art MT system.
We have presented a simple but effective approach to integrating the sentence skeleton information into a phrase-based system. The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data. In our future work we plan to investigate methods of integrating both syntactic models (for skeleton translation) and phrasal models (for full translation) in our system. We also plan to study sophisticated reordering models for skeleton translation, rather than reusing the baseline reordering model which is learned on the full sentences.
This work was supported in part by the National Science Foundation of China (Grants 61272376 and 61300097), and the China Postdoctoral Science Foundation (Grant 2013M530131). The authors would like to thank the anonymous reviewers for their pertinent and insightful comments.