This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers. We perform unsupervised learning using noise-contrastive estimation [15, 26], which utilizes artificially generated negative samples. Our alignment model is directional, similar to the generative IBM models [4]. To overcome this limitation, we encourage agreement between the two directional models by introducing a penalty function that ensures word embedding consistency across two directional models during training. The RNN-based model outperforms the feed-forward neural network-based model [40] as well as the IBM Model 4 under Japanese-English and French-English word alignment tasks, and achieves comparable translation performance to those baselines for Japanese-English and Chinese-English translation tasks.
Automatic word alignment is an important task for statistical machine translation. The most classical approaches are the probabilistic IBM models 1-5 [4] and the HMM model [39]. Various studies have extended those models. Yang et al. [40] adapted the Context-Dependent Deep Neural Network for HMM (CD-DNN-HMM) [7], a type of feed-forward neural network (FFNN)-based model, to the HMM alignment model and achieved state-of-the-art performance. However, the FFNN-based model assumes a first-order Markov dependence for alignments.
Recurrent neural network (RNN)-based models have recently demonstrated state-of-the-art performance that outperformed FFNN-based models for various tasks [24, 25, 1, 16, 34]. An RNN has a hidden layer with recurrent connections that propagates its own previous signals. Through the recurrent architecture, RNN-based models have the inherent property of modeling long-span dependencies, e.g., long contexts, in input data. We assume that this property would fit with a word alignment task, and we propose an RNN-based word alignment model. Our model can maintain and arbitrarily integrate an alignment history, e.g., bilingual context, which is longer than the FFNN-based model.
The NN-based alignment models are supervised models. Unfortunately, it is usually difficult to prepare word-by-word aligned bilingual data. Yang et al. [40] trained their model from word alignments produced by traditional unsupervised probabilistic models. However, with this approach, errors induced by probabilistic models are learned as correct alignments; thus, generalization capabilities are limited. To solve this problem, we apply noise-contrastive estimation (NCE) [15, 26] for unsupervised training of our RNN-based model without gold standard alignments or pseudo-oracle alignments. NCE artificially generates bilingual sentences through samplings as pseudo-negative samples, and then trains the model such that the scores of the original bilingual sentences are higher than those of the sampled bilingual sentences.
Our RNN-based alignment model has a direction, such as other alignment models, i.e., from (source language) to (target language) and from to . It has been proven that the limitation may be overcome by encouraging two directional models to agree by training them concurrently [22, 21, 14, 11]. The motivation for this stems from the fact that model and generalization errors by the two models differ, and the models must complement each other. Based on this motivation, our directional models are also simultaneously trained. Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function. This constraint prevents each model from overfitting to a particular direction and leads to global optimization across alignment directions.
This paper presents evaluations of Japanese-English and French-English word alignment tasks and Japanese-to-English and Chinese-to-English translation tasks. The results illustrate that our RNN-based model outperforms the FFNN-based model (up to +0.0792 F1-measure) and the IBM Model 4 (up to +0.0703 F1-measure) for the word alignment tasks. For the translation tasks, our model achieves up to 0.74% gain in BLEU as compared to the FFNN-based model, which matches the translation qualities of the IBM Model 4.
Various word alignment models have been proposed. These models are roughly clustered into two groups: generative models, such as those proposed by Brown et al. [4], Vogel et al. [39], and Och and Ney [28], and discriminative models, such as those proposed by Taskar et al. [36], Moore [27], and Blunsom and Cohn [3].
Given a source language sentence and a target language sentence , is generated by via the alignment . Each is a hidden variable indicating that the source word is aligned to the target word . Usually, a “null” word is added to the target language sentence and may contain , which indicates that is not aligned to any target word. The probability of generating the sentence from is defined as
(1) |
The IBM Models 1 and 2 and the HMM model decompose it into an alignment probability and a lexical translation probability as
(2) |
The three models differ in their definition of alignment probability. For example, the HMM model uses an alignment probability with a first-order Markov property: . In addition, the IBM models 3-5 are extensions of these, which consider the fertility and distortion of each translated word.
These models are trained using the expectation-maximization algorithm [8] from bilingual sentences without word-level alignments (unlabeled training data). Given a specific model, the best alignment (Viterbi alignment) of the sentence pair (, ) can be found as
(3) |
For example, the HMM model identifies the Viterbi alignment using the Viterbi algorithm.
As an instance of discriminative models, we describe an FFNN-based word alignment model [40], which is our baseline. An FFNN learns a hierarchy of nonlinear features that can automatically capture complex statistical patterns in input data. Recently, FFNNs have been applied successfully to several tasks, such as speech recognition [7], statistical machine translation [20, 38], and other popular natural language processing tasks [6, 5].
Yang et al. [40] have adapted a type of FFNN, i.e., CD-DNN-HMM [7], to the HMM alignment model. Specifically, the lexical translation and alignment probability in Eq. 2 are computed using FFNNs as
(4) |
where and are an alignment score and a lexical translation score, respectively, is a score of alignments , and “” denotes a context of word . Note that the model uses nonprobabilistic scores rather than probabilities because normalization over all words is computationally expensive. The model finds the Viterbi alignment using the Viterbi algorithm, similar to the classic HMM model. Note that alignments in the FFNN-based model are also governed by first-order Markov dynamics because an alignment score depends on the previous alignment .
Figure 1 shows the network structure with one hidden layer for computing a lexical translation probability . The model consists of a lookup layer, a hidden layer, and an output layer, which have weight matrices. The model receives a source and target word with their contexts as inputs, which are words in a predefined window (the window size is three in Figure 1). First, the lookup layer converts each input word into its word embedding by looking up its corresponding column in the embedding matrix (), and then concatenates them. Let (or ) be a set of source words (or target words) and be a predetermined embedding length. is a matrix11We add a special token to handle unknown words and to handle null alignments to and . Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words [2]. The concatenation () is then fed to the hidden layer to capture nonlinear relations. Finally, the output layer receives the output of the hidden layer () and computes a lexical translation score.
The computations in the hidden and output layer are as follows22Consecutive hidden layers can be used: . For simplicity, this paper describes the model with 1 hidden layer.:
(5) | |||
(6) |
where , , , and are , , , and matrices, respectively, and is an activation function. Following Yang et al. [40], a “hard” version of the hyperbolic tangent, htanh33htanh for , htanh for , and htanh for others., is used as in this study.
The alignment model based on an FFNN is formed in the same manner as the lexical translation model. Each model is optimized by minimizing the following ranking loss with a margin using stochastic gradient descent (SGD)44In our experiments, we used a mini-batch SGD instead of a plain SGD., where gradients are computed by the back-propagation algorithm [31]:
(7) |
where denotes the weights of layers in the model, is a set of training data, is the gold standard alignment, is the incorrect alignment with the highest score under , and denotes the score defined by Eq. 4 as computed by the model under .
This section proposes an RNN-based alignment model, which computes a score for alignments using an RNN:
(8) |
where is the score of an alignment . The prediction of the -th alignment depends on all preceding alignments . Note that the proposed model also uses nonprobabilistic scores, similar to the FFNN-based model.
The RNN-based model is illustrated in Figure 2. The model consists of a lookup layer, a hidden layer, and an output layer, which have weight matrices , , and , respectively. Each matrix in the hidden layer (, , and ) depends on alignment, where denotes the jump distance from to : . In our experiments, we merge distances that are greater than 8 and less than -8 into the special “8” and “-8” distances, respectively. Specifically, the hidden layer has weight matrices {, , , , , , , , , , , , , , } and computes using the corresponding matrices of the jump distance .
The Viterbi alignment is determined using the Viterbi algorithm, similar to the FFNN-based model, where the model is sequentially applied from to 55Strictly speaking, we cannot apply the dynamic programming forward-backward algorithm (i.e., the Viterbi algorithm) due to the long alignment history of . Thus, the Viterbi alignment is computed approximately using heuristic beam search. . When computing the score of the alignment between and , the two words are input to the lookup layer. In the lookup layer, each of these words is converted to its word embedding, and then the concatenation of the two embeddings () is fed to the hidden layer in the same manner as the FFNN-based model. Next, the hidden layer receives the output of the lookup layer () and that of the previous hidden layer (). The hidden layer then computes and outputs the nonlinear relations between them. Note that the weight matrices used in this computation are embodied by the specific jump distance . The output of the hidden layer () is copied and fed to the output layer and the next hidden layer. Finally, the output layer computes the score of () from the output of the hidden layer (). Note that the FFNN-based model consists of two components: one is for lexical translation and the other is for alignment. The proposed RNN produces a single score that is constructed in the hidden layer by employing the distance-dependent weight matrices.
Specifically, the computations in the hidden and output layer are as follows:
(9) | |||
(10) |
where , , , , and are , , , , and matrices, respectively. Note that . is an activation function, which is a hard hyperbolic tangent, i.e., htanh, in this study.
As described above, the RNN-based model has a hidden layer with recurrent connections. Under the recurrence, the proposed model compactly encodes the entire history of previous alignments in the hidden layer configuration . Therefore, the proposed model can find alignments by taking advantage of the long alignment history, while the FFNN-based model considers only the last alignment.
During training, we optimize the weight matrices of each layer (i.e., , , , , , and ) following a given objective using a mini-batch SGD with batch size , which converges faster than a plain SGD ( = 1). Gradients are computed by the back-propagation through time algorithm [31], which unfolds the network in time () and computes gradients over time steps. In addition, an regularization term is added to the objective to prevent the model from overfitting the training data.
The RNN-based model can be trained by a supervised approach, similar to the FFNN-based model, where training proceeds based on the ranking loss defined by Eq. 7 (Section 2.2). However, this approach requires gold standard alignments. To overcome this drawback, we propose an unsupervised method using NCE, which learns from unlabeled training data.
Dyer et al. [9] presented an unsupervised alignment model based on contrastive estimation (CE) [32]. CE seeks to discriminate observed data from its neighborhood, which can be viewed as pseudo-negative samples. Dyer et al. [9] regarded all possible alignments of the bilingual sentences, which are given as training data (), and those of the full translation search space () as the observed data and its neighborhood, respectively.
We introduce this idea to a ranking loss with margin as
(11) |
where is a set of all possible alignments given , E is the expected value of the scores on , denotes a target language sentence in the training data, and denotes a pseudo-target language sentence. The first expectation term is for the observed data, and the second is for the neighborhood.
However, the computation for is prohibitively expensive. To reduce computation, we employ NCE, which uses randomly sampled sentences from all target language sentences in as , and calculate the expected values by a beam search with beam width to truncate alignments with low scores. In our experiments, we set to 100. In addition, the above criterion is converted to an online fashion as
(12) |
where is a target language sentence aligned to in the training data, i.e., , is a randomly sampled pseudo-target language sentence with length , and denotes the number of pseudo-target language sentences per source sentence . Note that . GEN is a subset of all possible word alignments , which is generated by beam search.
In a simple implementation, each is generated by repeating a random sampling from a set of target words () times and lining them up sequentially. To employ more discriminative negative samples, our implementation samples each word of from a set of the target words that co-occur with whose probability is above a threshold under the IBM Model 1 incorporating prior [37]. The IBM Model 1 with prior is convenient for reducing translation candidates because it generates more sparse alignments than the standard IBM Model 1.
Both of the FFNN-based and RNN-based models are based on the HMM alignment model, and they are therefore asymmetric, i.e., they can represent one-to-many relations from the target side. Asymmetric models are usually trained in each alignment direction. The model proposed by Yang et al. [40] is no exception. However, it has been demonstrated that encouraging directional models to agree improves alignment performance [22, 21, 14, 11].
Inspired by their work, we introduce an agreement constraint to our learning. The constraint concretely enforces agreement in word embeddings of both directions. The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:
(13) | |||
(14) |
where (or ) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, denotes weights of a lookup layer, i.e., word embeddings, and is a parameter that controls the strength of the agreement constraint. indicates the norm of . 2-norm is used in our experiments. Equations 13 and 14 can be applied to both supervised and unsupervised approaches. Equations 7 and 12 are substituted into in supervised and unsupervised learning, respectively. The proposed constraint penalizes overfitting to a particular direction and enables two directional models to optimize across alignment directions globally.
Algorithm 1 Training Algorithm |
Input: , , training data , , |
batch size , , , , , |
1: for all such that do |
2: sample() |
3-1: neg |
3-2: neg |
4-1: update |
4-2: update |
5: end for |
Output: |
Our unsupervised learning procedure is summarized in Algorithm 1. In Algorithm 1, line 2 randomly samples bilingual sentences from training data . Lines 3-1 and 3-2 generate pseudo-negative samples for each and based on the translation candidates of and found by the IBM Model 1 with prior, (Section 4.1). Lines 4-1 and 4-2 update the weights in each layer following a given objective (Sections 4.1 and 4.2). Note that and are concurrently updated in each iteration, and (or ) is employed to enforce agreement between word embeddings when updating (or ).
Train | Dev | Test | ||
9 K | 0 | 960 | ||
1.1 M | 37 | 447 | ||
240 K | 878 | 919 | ||
1,597 | ||||
40 K | 2,501 | 489 | ||
3.2 M | 2,000 | 2,000 |
We evaluated the alignment performance of the proposed models with two tasks: Japanese-English word alignment with the Basic Travel Expression Corpus () [35] and French-English word alignment with the Hansard dataset () from the 2003 NAACL shared task [23]. In addition, we evaluated the end-to-end translation performance of three tasks: a Chinese-to-English translation task with the FBIS corpus (), the IWSLT 2007 Japanese-to-English translation task () [10], and the NTCIR-9 Japanese-to-English patent translation task () [13]66We did not evaluate the translation performance on the Hansards data because the development data is very small and performance is unreliable..
Table 1 shows the sizes of our experimental datasets. Note that the development data was not used in the alignment tasks, i.e., and , because the hyperparameters of the alignment models were set by preliminary small-scale experiments. The data is the first 9,960 sentence pairs in the training data for , which were annotated with word alignment [12]. We split these pairs into the first 9,000 for training data and the remaining 960 as test data. All the data in is word-aligned, and the training data in is unlabeled data. In , we used the NIST02 evaluation data as the development data, and the NIST03 and 04 evaluation data as test data ( and ).
We evaluated the proposed RNN-based alignment models against two baselines: the IBM Model 4 and the FFNN-based model with one hidden layer. The IBM Model 4 was trained by previously presented model sequence schemes [28]: , i.e., five iterations of the IBM Model 1 followed by five iterations of the HMM Model, etc., which is the default setting for GIZA++ (). For the FFNN-based model, we set the word embedding length to 30, the number of units of a hidden layer to 100, and the window size of contexts to 5. Hence, is 300 (). Following Yang et al. [40], the FFNN-based model was trained by the supervised approach described in Section 2.2 ().
For the RNN-based models, we set to 30 and the number of units of each recurrent hidden layer to 100. Thus, is 60 (). The number of units of each layer of the FFNN-based and RNN-based models and were set through preliminary experiments. To demonstrate the effectiveness of the proposed learning methods, we evaluated four types of RNN-based models: , , , and , where “/” denotes a supervised/unsupervised model and “” indicates that the agreement constraint was used.
In training all the models except , the weights of each layer were initialized first. For the weights of a lookup layer , we preliminarily trained word embeddings for the source and target language from each side of the training data. We then set the word embeddings to to avoid falling into local minima. Other weights were randomly initialized to . For the pretraining, we used the RNNLM Toolkit 77http://www.fit.vutbr.cz/~imikolov/rnnlm/ [24] with the default options. We mapped all words that occurred less than five times to the special token . Next, each weight was optimized using the mini-batch SGD, where batch size was 100, learning rate was 0.01, and an regularization parameter was 0.1. The training stopped after 50 epochs. The other parameters were set as follows: , and in the unsupervised learning were 100, 50, and 0.001, respectively, and for the agreement constraint was 0.1.
In the translation tasks, we used the Moses phrase-based SMT systems [17]. All Japanese and Chinese sentences were segmented by ChaSen88http://chasen-legacy.sourceforge.jp/ and the Stanford Chinese segmenter99http://nlp.stanford.edu/software/segmenter.shtml, respectively. In the training, long sentences with over 40 words were filtered out. Using the SRILM Toolkits [33] with modified Kneser-Ney smoothing, we trained a 5-gram language model on the English side of each training data for and , and a 5-gram language model on the Xinhua portion of the English Gigaword corpus for . The SMT weighting parameters were tuned by MERT [29] in the development data.
Alignment | ||
---|---|---|
0.4859 | 0.9029 | |
I | 0.4770 | 0.9020 |
I | 0.5053 | 0.9068 |
I | 0.5174 | 0.9202 |
0.5307 | 0.9037 | |
0.5562 | 0.9275 | |
R | 0.8224 | - |
R | 0.8798 | - |
R | 0.8921 | - |
Table 2 shows the alignment performance by the F1-measure. Hereafter, and denote the trained from gold standard alignments and word alignments found by the IBM Model 4, respectively. In , all models were trained from randomly sampled 100 K data1010Due to high computational cost, we did not use all the training data. Scaling up to larger datasets will be addressed in future work.. We evaluated the word alignments produced by first applying each model in both directions and then combining the alignments using the “grow-diag-final-and” heuristic [18]. The significance test on word alignment performance was performed by the sign test with a 5% significance level. “+” in Table 2 indicates that the comparisons are significant over corresponding baselines, and .
In Table 2, , which includes all our proposals, i.e., the RNN-based model, the unsupervised learning, and the agreement constraint, achieves the best performance for both and . The differences from the baselines are statistically significant.
Table 2 shows that outperforms , which is statistically significant in . These results demonstrate that capturing the long alignment history in the RNN-based model improves the alignment performance. We discuss the difference of the RNN-based model’s effectiveness between language pairs in Section 6.1. Table 2 also shows that and achieve significantly better performance than and in both tasks, respectively. This indicates that the proposed agreement constraint is effective in training better models in both the supervised and unsupervised approaches.
In , and significantly outperform and , respectively. The performance of these models is comparable with . This indicates that our unsupervised learning benefits our models because the supervised models are adversely affected by errors in the automatically generated training data. This is especially true when the quality of training data, i.e., the performance of , is low.
Alignment | ||||
46.47 | 27.91 | 25.90 | 28.34 | |
27.25 | 25.41 | 27.65 | ||
I | 46.38 | 27.05 | 25.45 | 27.61 |
I | 46.43 | 27.24 | 25.47 | 27.56 |
I | 46.51 | 27.12 | 25.55 | 27.73 |
47.05 | 27.79 | 25.76 | 27.91 | |
46.97 | 27.76 | 25.84 | 28.20 |
Table 3 shows the translation performance by the case sensitive BLEU4 metric1111We used mteval-v13a.pl as the evaluation tool (http://www.itl.nist.gov/iad/mig/tests/mt/2009/). [30]. Table 3 presents the average BLEU of three different MERT runs. In and , each alignment model was trained from the randomly sampled 100 K data, and then a translation model was trained from all the training data that was word-aligned by the alignment model. In addition, for a detailed comparison, we evaluated the SMT system where the IBM Model 4 was trained from all the training data (). The significance test on translation performance was performed by the bootstrap method [19] with a 5% significance level. “*” in Table 3 indicates that the comparisons are significant over both baselines, i.e., and .
Table 3 also shows that better word alignment does not always result in better translation, which has been discussed previously [40]. However, and outperform and in all tasks. These results indicate that our proposals contribute to improving translation performance1212We also confirmed the effectiveness of our models on the NIST05 and NTCIR-10 evaluation data.. In addition, Table 3 shows that these proposed models are comparable to in and even though the proposed models are trained from only a small part of the training data.
Figure 3 shows word alignment examples from and , where solid squares indicate the gold standard alignments. Figure 3 (a) shows that adequately identifies complicated alignments with long distances compared to (e.g., jaggy alignments of “have you been learning” in Fig 3 (a)) because captures alignment paths based on long alignment history, which can be viewed as phrase-level alignments, while employs only the last alignment.
In French-English word alignment, the most valuable clues are located locally because English and French have similar word orders and their alignment has more one-to-one mappings than Japanese-English word alignment (Figure 3). Figure 3 (b) shows that both and work for such simpler alignments. Therefore, the RNN-based model has less effect on French-English word alignment than Japanese-English word alignment, as indicated in Table 2.
Alignment | 40 K | 9 K | 1 K |
---|---|---|---|
0.5467 | 0.4859 | 0.4128 | |
0.6004 | 0.5562 | 0.4842 | |
- | 0.8921 | 0.6063 |
Table 4 shows the alignment performance on with various training data sizes, i.e., training data for (40 K), training data for (9 K), and the randomly sampled 1 K data from the training data. Note that cannot be trained from the 40 K data because the 40 K data does not have gold standard word alignments.
Table 4 demonstrates that the proposed RNN-based model outperforms trained from the unlabeled 40 K data by employing either the 1 K labeled data or the 9 K unlabeled data, which is less than 25% of the training data for . Consequently, the SMT system using trained from a small part of training data can achieve comparable performance to that using trained from all training data, which is shown in Table 3.
Alignment | ||
---|---|---|
(I) | 0.4770 | 0.9020 |
(I) | 0.4854 | 0.9085 |
0.5105 | 0.9026 | |
0.5313 | 0.9144 | |
(R) | 0.8224 | - |
(R) | 0.8367 | - |
The proposed unsupervised learning and agreement constraints can be applied to any NN-based alignment model. Table 5 shows the alignment performance of the FFNN-based models trained by our supervised/unsupervised approaches (s/u) with and without our agreement constraints. In Table 5, “+c” denotes that the agreement constraint was used, and “+” indicates that the comparison with its corresponding baseline, i.e., (I/R), is significant in the sign test with a 5% significance level.
Table 5 shows that (R/I) and achieve significantly better performance than (R/I) and , respectively, in both and . In addition, and significantly outperform (I) and (I), respectively, in . The performance of these models is comparable in . These results indicate that the proposed unsupervised learning and agreement constraint benefit the FFNN-based model, similar to the RNN-based model.
We have proposed a word alignment model based on an RNN, which captures long alignment history through recurrent architectures. Furthermore, we proposed an unsupervised method for training our model using NCE and introduced an agreement constraint that encourages word embeddings to be consistent across alignment directions. Our experiments have shown that the proposed model outperforms the FFNN-based model [40] for word alignment and machine translation, and that the agreement constraint improves alignment performance.
In future, we plan to employ contexts composed of surrounding words (e.g., or in the FFNN-based model) in our model, even though our model implicitly encodes such contexts in the alignment history. We also plan to enrich each hidden layer in our model with multiple layers following the success of Yang et al. [40], in which multiple hidden layers improved the performance of the FFNN-based model. In addition, we would like to prove the effectiveness of the proposed method for other datasets.
We thank the anonymous reviewers for their helpful suggestions and valuable comments on the first version of this paper.