Recurrent Neural Networks for Word Alignment Model

Akihiro Tamura, Taro Watanabe, Eiichiro Sumita
National Institute of Information and Communications Technology
3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, JAPAN
a-tamura@ah.jp.nec.com,
{taro.watanabe, eiichiro.sumita}@nict.go.jp
The first author is now affiliated with Knowledge Discovery Research Laboratories, NEC Corporation, Nara, Japan.

Abstract

This study proposes a word alignment model based on a recurrent neural network (RNN), in which an unlimited alignment history is represented by recurrently connected hidden layers. We perform unsupervised learning using noise-contrastive estimation [15, 26], which utilizes artificially generated negative samples. Our alignment model is directional, similar to the generative IBM models [4]. To overcome this limitation, we encourage agreement between the two directional models by introducing a penalty function that ensures word embedding consistency across two directional models during training. The RNN-based model outperforms the feed-forward neural network-based model [40] as well as the IBM Model 4 under Japanese-English and French-English word alignment tasks, and achieves comparable translation performance to those baselines for Japanese-English and Chinese-English translation tasks.

1 Introduction

Automatic word alignment is an important task for statistical machine translation. The most classical approaches are the probabilistic IBM models 1-5 [4] and the HMM model [39]. Various studies have extended those models. Yang et al. [40] adapted the Context-Dependent Deep Neural Network for HMM (CD-DNN-HMM) [7], a type of feed-forward neural network (FFNN)-based model, to the HMM alignment model and achieved state-of-the-art performance. However, the FFNN-based model assumes a first-order Markov dependence for alignments.

Recurrent neural network (RNN)-based models have recently demonstrated state-of-the-art performance that outperformed FFNN-based models for various tasks [24, 25, 1, 16, 34]. An RNN has a hidden layer with recurrent connections that propagates its own previous signals. Through the recurrent architecture, RNN-based models have the inherent property of modeling long-span dependencies, e.g., long contexts, in input data. We assume that this property would fit with a word alignment task, and we propose an RNN-based word alignment model. Our model can maintain and arbitrarily integrate an alignment history, e.g., bilingual context, which is longer than the FFNN-based model.

The NN-based alignment models are supervised models. Unfortunately, it is usually difficult to prepare word-by-word aligned bilingual data. Yang et al. [40] trained their model from word alignments produced by traditional unsupervised probabilistic models. However, with this approach, errors induced by probabilistic models are learned as correct alignments; thus, generalization capabilities are limited. To solve this problem, we apply noise-contrastive estimation (NCE) [15, 26] for unsupervised training of our RNN-based model without gold standard alignments or pseudo-oracle alignments. NCE artificially generates bilingual sentences through samplings as pseudo-negative samples, and then trains the model such that the scores of the original bilingual sentences are higher than those of the sampled bilingual sentences.

Our RNN-based alignment model has a direction, such as other alignment models, i.e., from $f$ (source language) to $e$ (target language) and from $e$ to $f$ . It has been proven that the limitation may be overcome by encouraging two directional models to agree by training them concurrently [22, 21, 14, 11]. The motivation for this stems from the fact that model and generalization errors by the two models differ, and the models must complement each other. Based on this motivation, our directional models are also simultaneously trained. Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function. This constraint prevents each model from overfitting to a particular direction and leads to global optimization across alignment directions.

This paper presents evaluations of Japanese-English and French-English word alignment tasks and Japanese-to-English and Chinese-to-English translation tasks. The results illustrate that our RNN-based model outperforms the FFNN-based model (up to +0.0792 F1-measure) and the IBM Model 4 (up to +0.0703 F1-measure) for the word alignment tasks. For the translation tasks, our model achieves up to 0.74% gain in BLEU as compared to the FFNN-based model, which matches the translation qualities of the IBM Model 4.

2 Related Work

Various word alignment models have been proposed. These models are roughly clustered into two groups: generative models, such as those proposed by Brown et al. [4], Vogel et al. [39], and Och and Ney [28], and discriminative models, such as those proposed by Taskar et al. [36], Moore [27], and Blunsom and Cohn [3].

2.1 Generative Alignment Model

Given a source language sentence $f_{1}^{J}=f_{1},...,f_{J}$ and a target language sentence $e_{1}^{I}=e_{1},...,e_{I}$ , $f_{1}^{J}$ is generated by $e_{1}^{I}$ via the alignment $a_{1}^{J}=a_{1},...,a_{J}$ . Each $a_{j}$ is a hidden variable indicating that the source word $f_{j}$ is aligned to the target word $e_{a_{j}}$ . Usually, a “null” word $e_{0}$ is added to the target language sentence and $a_{1}^{J}$ may contain $a_{j}=0$ , which indicates that $f_{j}$ is not aligned to any target word. The probability of generating the sentence $f_{1}^{J}$ from $e_{1}^{I}$ is defined as

p(f_{1}^{J}|e_{1}^{I})=\sum_{a_{1}^{J}}p(f_{1}^{J},a_{1}^{J}|e_{1}^{I}).

(1)

The IBM Models 1 and 2 and the HMM model decompose it into an alignment probability $p_{a}$ and a lexical translation probability $p_{t}$ as

$\displaystyle p(f_{1}^{J},a_{1}^{J}|e_{1}^{I})=\prod_{j=1}^{J}p_{a}(a_{j}|a_{j% -1},j)p_{t}(f_{j}|e_{a_{j}}).$

(2)

The three models differ in their definition of alignment probability. For example, the HMM model uses an alignment probability with a first-order Markov property: $p_{a}(a_{j}|a_{j}-a_{j-1})$ . In addition, the IBM models 3-5 are extensions of these, which consider the fertility and distortion of each translated word.

These models are trained using the expectation-maximization algorithm [8] from bilingual sentences without word-level alignments (unlabeled training data). Given a specific model, the best alignment (Viterbi alignment) of the sentence pair ( $f_{1}^{J}$ , $e_{1}^{I}$ ) can be found as

\hat{a}_{1}^{J}=\argmax_{a_{1}^{J}}p(f_{1}^{J},a_{1}^{J}|e_{1}^{I}).

(3)

For example, the HMM model identifies the Viterbi alignment using the Viterbi algorithm.

2.2 FFNN-based Alignment Model

As an instance of discriminative models, we describe an FFNN-based word alignment model [40], which is our baseline. An FFNN learns a hierarchy of nonlinear features that can automatically capture complex statistical patterns in input data. Recently, FFNNs have been applied successfully to several tasks, such as speech recognition [7], statistical machine translation [20, 38], and other popular natural language processing tasks [6, 5].

Yang et al. [40] have adapted a type of FFNN, i.e., CD-DNN-HMM [7], to the HMM alignment model. Specifically, the lexical translation and alignment probability in Eq. 2 are computed using FFNNs as

\scalebox{1.0}{$\displaystyle s_{NN}(a_{1}^{J}|f_{1}^{J},e_{1}^{I})=\prod_{j=1% }^{J}t_{a}(a_{j}-a_{j-1}|c(e_{a_{j-1}}))$}\\ \scalebox{1.0}{$\displaystyle\cdot t_{lex}(f_{j},e_{a_{j}}|c(f_{j}),c(e_{a_{j}% })),$}

(4)

where $t_{a}$ and $t_{lex}$ are an alignment score and a lexical translation score, respectively, $s_{NN}$ is a score of alignments $a_{1}^{J}$ , and “ $c(\text{a word }w)$ ” denotes a context of word $w$ . Note that the model uses nonprobabilistic scores rather than probabilities because normalization over all words is computationally expensive. The model finds the Viterbi alignment using the Viterbi algorithm, similar to the classic HMM model. Note that alignments in the FFNN-based model are also governed by first-order Markov dynamics because an alignment score depends on the previous alignment $a_{j-1}$ .

Figure 1: FFNN-based model for computing a lexical translation score of

(f_{j},e_{a_{j}})

Figure 1 shows the network structure with one hidden layer for computing a lexical translation probability $t_{lex}(f_{j},e_{a_{j}}|c(f_{j}),c(e_{a_{j}}))$ . The model consists of a lookup layer, a hidden layer, and an output layer, which have weight matrices. The model receives a source and target word with their contexts as inputs, which are words in a predefined window (the window size is three in Figure 1). First, the lookup layer converts each input word into its word embedding by looking up its corresponding column in the embedding matrix ( $L$ ), and then concatenates them. Let $V_{f}$ (or $V_{e}$ ) be a set of source words (or target words) and $M$ be a predetermined embedding length. $L$ is a $M\times(|V_{f}|+|V_{e}|)$ matrix¹¹We add a special token $\langle unk\rangle$ to handle unknown words and $\langle null\rangle$ to handle null alignments to $V_{f}$ and $V_{e}$ . Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words [2]. The concatenation ( $z_{0}$ ) is then fed to the hidden layer to capture nonlinear relations. Finally, the output layer receives the output of the hidden layer ( $z_{1}$ ) and computes a lexical translation score.

The computations in the hidden and output layer are as follows²²Consecutive $l$ hidden layers can be used: $z_{l}=f(H_{l}\times z_{l-1}+B_{H_{l}})$ . For simplicity, this paper describes the model with 1 hidden layer.:

	$\displaystyle z_{1}=f(H\times z_{0}+B_{H}),$		(5)
	$\displaystyle t_{lex}=O\times z_{1}+B_{O},$		(6)

where $H$ , $B_{H}$ , $O$ , and $B_{O}$ are $|z_{1}|\times|z_{0}|$ , $|z_{1}|\times 1$ , $1\times|z_{1}|$ , and $1\times 1$ matrices, respectively, and $f(x)$ is an activation function. Following Yang et al. [40], a “hard” version of the hyperbolic tangent, htanh $(x)$ ³³htanh $(x)=-1$ for $x<-1$ , htanh $(x)=1$ for $x>1$ , and htanh $(x)=x$ for others., is used as $f(x)$ in this study.

The alignment model based on an FFNN is formed in the same manner as the lexical translation model. Each model is optimized by minimizing the following ranking loss with a margin using stochastic gradient descent (SGD)⁴⁴In our experiments, we used a mini-batch SGD instead of a plain SGD., where gradients are computed by the back-propagation algorithm [31]:

\hbox to 0.0pt{$\displaystyle\scalebox{1.0}{$\displaystyle loss(\theta)=\sum_{% (\bm{f},\bm{e})\in T}\text{max}\{0,1-s_{\theta}(\bm{a^{+}}|\bm{f},\bm{e})$}$}% \\ \scalebox{1.0}{$\displaystyle+s_{\theta}(\bm{a^{-}}|\bm{f},\bm{e})\},$}

(7)

where $\theta$ denotes the weights of layers in the model, $T$ is a set of training data, $\bm{a^{+}}$ is the gold standard alignment, $\bm{a^{-}}$ is the incorrect alignment with the highest score under $\theta$ , and $s_{\theta}$ denotes the score defined by Eq. 4 as computed by the model under $\theta$ .

3 RNN-based Alignment Model

This section proposes an RNN-based alignment model, which computes a score for alignments $a_{1}^{J}$ using an RNN:

$\displaystyle s_{NN}(a_{1}^{J}|f_{1}^{J},e_{1}^{I})=\prod_{j=1}^{J}t_{RNN}(a_{% j}|a_{1}^{j-1},f_{j},e_{a_{j}}),$

(8)

where $t_{RNN}$ is the score of an alignment $a_{j}$ . The prediction of the $j$ -th alignment $a_{j}$ depends on all preceding alignments $a_{1}^{j-1}$ . Note that the proposed model also uses nonprobabilistic scores, similar to the FFNN-based model.

Figure 2: RNN-based alignment model

The RNN-based model is illustrated in Figure 2. The model consists of a lookup layer, a hidden layer, and an output layer, which have weight matrices $L$ , $\{H^{d},R^{d},B_{H}^{d}\}$ , and $\{O,B_{O}\}$ , respectively. Each matrix in the hidden layer ( $H^{d}$ , $R^{d}$ , and $B_{H}^{d}$ ) depends on alignment, where $d$ denotes the jump distance from $a_{j-1}$ to $a_{j}$ : $d=a_{j}-a_{j-1}$ . In our experiments, we merge distances that are greater than 8 and less than -8 into the special “ $\geq$ 8” and “ $\leq$ -8” distances, respectively. Specifically, the hidden layer has weight matrices { $H^{\leq-8}$ , $H^{-7}$ , $\cdots$ , $H^{7}$ , $H^{\geq 8}$ , $R^{\leq-8}$ , $R^{-7}$ , $\cdots$ , $R^{7}$ , $R^{\geq 8}$ , $B_{H}^{\leq-8}$ , $B_{H}^{-7}$ , $\cdots$ , $B_{H}^{7}$ , $B_{H}^{\geq 8}$ } and computes $y_{j}$ using the corresponding matrices of the jump distance $d$ .

The Viterbi alignment is determined using the Viterbi algorithm, similar to the FFNN-based model, where the model is sequentially applied from $f_{1}$ to $f_{J}$ ⁵⁵Strictly speaking, we cannot apply the dynamic programming forward-backward algorithm (i.e., the Viterbi algorithm) due to the long alignment history of $y_{i}$ . Thus, the Viterbi alignment is computed approximately using heuristic beam search. . When computing the score of the alignment between $f_{j}$ and $e_{a_{j}}$ , the two words are input to the lookup layer. In the lookup layer, each of these words is converted to its word embedding, and then the concatenation of the two embeddings ( $x_{j}$ ) is fed to the hidden layer in the same manner as the FFNN-based model. Next, the hidden layer receives the output of the lookup layer ( $x_{j}$ ) and that of the previous hidden layer ( $y_{j-1}$ ). The hidden layer then computes and outputs the nonlinear relations between them. Note that the weight matrices used in this computation are embodied by the specific jump distance $d$ . The output of the hidden layer ( $y_{j}$ ) is copied and fed to the output layer and the next hidden layer. Finally, the output layer computes the score of $a_{j}$ ( $t_{RNN}(a_{j}|a_{1}^{j-1},f_{j},e_{a_{j}})$ ) from the output of the hidden layer ( $y_{j}$ ). Note that the FFNN-based model consists of two components: one is for lexical translation and the other is for alignment. The proposed RNN produces a single score that is constructed in the hidden layer by employing the distance-dependent weight matrices.

Specifically, the computations in the hidden and output layer are as follows:

	$\displaystyle y_{j}=f(H^{d}\times x_{j}+R^{d}\times y_{j-1}+B^{d}_{H}),$		(9)
	$\displaystyle t_{RNN}=O\times y_{j}+B_{O},$		(10)

where $H^{d}$ , $R^{d}$ , $B^{d}_{H}$ , $O$ , and $B_{O}$ are $|y_{j}|\times|x_{j}|$ , $|y_{j}|\times|y_{j-1}|$ , $|y_{j}|\times 1$ , $1\times|y_{j}|$ , and $1\times 1$ matrices, respectively. Note that $|y_{j-1}|=|y_{j}|$ . $f(x)$ is an activation function, which is a hard hyperbolic tangent, i.e., htanh $(x)$ , in this study.

As described above, the RNN-based model has a hidden layer with recurrent connections. Under the recurrence, the proposed model compactly encodes the entire history of previous alignments in the hidden layer configuration $y_{i}$ . Therefore, the proposed model can find alignments by taking advantage of the long alignment history, while the FFNN-based model considers only the last alignment.

4 Training

During training, we optimize the weight matrices of each layer (i.e., $L$ , $H^{d}$ , $R^{d}$ , $B^{d}_{H}$ , $O$ , and $B_{O}$ ) following a given objective using a mini-batch SGD with batch size $D$ , which converges faster than a plain SGD ( $D$ = 1). Gradients are computed by the back-propagation through time algorithm [31], which unfolds the network in time ( $j$ ) and computes gradients over time steps. In addition, an $l2$ regularization term is added to the objective to prevent the model from overfitting the training data.

The RNN-based model can be trained by a supervised approach, similar to the FFNN-based model, where training proceeds based on the ranking loss defined by Eq. 7 (Section 2.2). However, this approach requires gold standard alignments. To overcome this drawback, we propose an unsupervised method using NCE, which learns from unlabeled training data.

4.1 Unsupervised Learning

Dyer et al. [9] presented an unsupervised alignment model based on contrastive estimation (CE) [32]. CE seeks to discriminate observed data from its neighborhood, which can be viewed as pseudo-negative samples. Dyer et al. [9] regarded all possible alignments of the bilingual sentences, which are given as training data ( $T$ ), and those of the full translation search space ( $\Omega$ ) as the observed data and its neighborhood, respectively.

We introduce this idea to a ranking loss with margin as

\hbox to 0.0pt{$\displaystyle\scalebox{0.9}{$\displaystyle loss(\theta)=\text{% max}\biggl\{0,1-\sum_{\bm{(f^{+},e^{+})}\in T}\text{E}_{\Phi}[s_{\theta}(\bm{a% }|\bm{f^{+}},\bm{e^{+}})]$}$}\\ \scalebox{0.9}{$\displaystyle+\sum_{(\bm{f^{+}},\bm{e^{-}})\in\Omega}\text{E}_% {\Phi}[s_{\theta}(\bm{a}|\bm{f^{+}},\bm{e^{-}})]\biggr\},$}

(11)

where $\Phi$ is a set of all possible alignments given $(\bm{f},\bm{e})$ , E ${}_{\Phi}[s_{\theta}]$ is the expected value of the scores $s_{\theta}$ on $\Phi$ , $\bm{e^{+}}$ denotes a target language sentence in the training data, and $\bm{e^{-}}$ denotes a pseudo-target language sentence. The first expectation term is for the observed data, and the second is for the neighborhood.

However, the computation for $\Omega$ is prohibitively expensive. To reduce computation, we employ NCE, which uses randomly sampled sentences from all target language sentences in $\Omega$ as $\bm{e^{-}}$ , and calculate the expected values by a beam search with beam width $W$ to truncate alignments with low scores. In our experiments, we set $W$ to 100. In addition, the above criterion is converted to an online fashion as

\hbox to 0.0pt{$\displaystyle\scalebox{0.95}{$\displaystyle loss(\theta)=\sum_% {\bm{f^{+}}\in T}\text{max}\biggr\{0,1-\text{E}_{\text{GEN}}[s_{\theta}(\bm{a}% |\bm{f^{+}},\bm{e^{+}})]$}$}\\ \scalebox{0.95}{$\displaystyle+\frac{1}{N}\sum_{\bm{e^{-}}}\text{E}_{\text{GEN% }}[s_{\theta}(\bm{a}|\bm{f^{+}},\bm{e^{-}})]\biggl\},$}

(12)

where $\bm{e^{+}}$ is a target language sentence aligned to $\bm{f^{+}}$ in the training data, i.e., $\bm{(f^{+},e^{+})}\in T$ , $\bm{e^{-}}$ is a randomly sampled pseudo-target language sentence with length $|\bm{e^{+}}|$ , and $N$ denotes the number of pseudo-target language sentences per source sentence $\bm{f^{+}}$ . Note that $|\bm{e^{+}}|=|\bm{e^{-}}|$ . GEN is a subset of all possible word alignments $\Phi$ , which is generated by beam search.

In a simple implementation, each $\bm{e^{-}}$ is generated by repeating a random sampling from a set of target words ( $V_{e}$ ) $|\bm{e^{+}}|$ times and lining them up sequentially. To employ more discriminative negative samples, our implementation samples each word of $\bm{e^{-}}$ from a set of the target words that co-occur with $f_{i}\in\bm{f^{+}}$ whose probability is above a threshold $C$ under the IBM Model 1 incorporating $l_{0}$ prior [37]. The IBM Model 1 with $l_{0}$ prior is convenient for reducing translation candidates because it generates more sparse alignments than the standard IBM Model 1.

4.2 Agreement Constraints

Both of the FFNN-based and RNN-based models are based on the HMM alignment model, and they are therefore asymmetric, i.e., they can represent one-to-many relations from the target side. Asymmetric models are usually trained in each alignment direction. The model proposed by Yang et al. [40] is no exception. However, it has been demonstrated that encouraging directional models to agree improves alignment performance [22, 21, 14, 11].

Inspired by their work, we introduce an agreement constraint to our learning. The constraint concretely enforces agreement in word embeddings of both directions. The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:

	$\displaystyle\argmin_{\theta_{FE}}\bigl\{loss(\theta_{FE})+\alpha\lVert\theta_% {L_{EF}}-\theta_{L_{FE}}\rVert\bigr\},$		(13)
	$\displaystyle\argmin_{\theta_{EF}}\bigl\{loss(\theta_{EF})+\alpha\lVert\theta_% {L_{FE}}-\theta_{L_{EF}}\rVert\bigr\},$		(14)

where $\theta_{FE}$ (or $\theta_{EF}$ ) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, $\theta_{L}$ denotes weights of a lookup layer, i.e., word embeddings, and $\alpha$ is a parameter that controls the strength of the agreement constraint. $\lVert\theta\rVert$ indicates the norm of $\theta$ . 2-norm is used in our experiments. Equations 13 and 14 can be applied to both supervised and unsupervised approaches. Equations 7 and 12 are substituted into $loss(\theta)$ in supervised and unsupervised learning, respectively. The proposed constraint penalizes overfitting to a particular direction and enables two directional models to optimize across alignment directions globally.

Algorithm 1 Training Algorithm

Input:

\theta_{FE}^{1}

\theta_{EF}^{1}

, training data

T

M a x I t e r

batch size

D

N

C

IBM1

W

\alpha

1: for all

t

such that

1\leq t\leq MaxIter

\{({\bf f^{+}},{\bf e^{+}})^{D}\}

\leftarrow

sample(

D, T

)

3-1:

\{({\bf f^{+}},{\bf\{e^{-}\}}^{N})^{D}\}

\leftarrow

neg

{}_{\text{e}}

(\{({\bf f^{+}},{\bf e^{+}})^{D}\},N,C,IBM1)

3-2:

\{({\bf e^{+}},{\bf\{f^{-}\}}^{N})^{D}\}

\leftarrow

neg

{}_{\text{f}}

(\{({\bf f^{+}},{\bf e^{+}})^{D}\},N,C,IBM1)

4-1:

\theta_{FE}^{t+1}

\leftarrow

update

(({\bf f^{+}},{\bf e^{+}},{\bf\{e^{-}\}}^{N})^{D},\theta_{FE}^{t},\theta_{EF}^% {t},W,\alpha)

4-2:

\theta_{EF}^{t+1}

\leftarrow

update

(({\bf e^{+}},{\bf f^{+}},{\bf\{f^{-}\}}^{N})^{D},\theta_{EF}^{t},\theta_{FE}^% {t},W,\alpha)

5: end for

Output:

\theta_{EF}^{MaxIter+1},\theta_{FE}^{MaxIter+1}

Our unsupervised learning procedure is summarized in Algorithm 1. In Algorithm 1, line 2 randomly samples $D$ bilingual sentences $({\bf f^{+}},{\bf e^{+}})^{D}$ from training data $T$ . Lines 3-1 and 3-2 generate $N$ pseudo-negative samples for each ${\bf f^{+}}$ and ${\bf e^{+}}$ based on the translation candidates of ${\bf f^{+}}$ and ${\bf e^{+}}$ found by the IBM Model 1 with $l_{0}$ prior, $IBM1$ (Section 4.1). Lines 4-1 and 4-2 update the weights in each layer following a given objective (Sections 4.1 and 4.2). Note that $\theta_{FE}^{t}$ and $\theta_{EF}^{t}$ are concurrently updated in each iteration, and $\theta_{EF}^{t}$ (or $\theta_{FE}^{t}$ ) is employed to enforce agreement between word embeddings when updating $\theta_{FE}^{t}$ (or $\theta_{EF}^{t}$ ).

5 Experiment

5.1 Experimental Data

		Train	Dev	Test
$B T E C$		9 K	0	960
$H a n s a r d s$		1.1 M	37	447
$F B I S$	$NIST03$	240 K	878	919
$F B I S$	$NIST04$	240 K	878	1,597
$I W S L T$		40 K	2,501	489
$N T C I R$		3.2 M	2,000	2,000

Table 1: Size of experimental datasets

We evaluated the alignment performance of the proposed models with two tasks: Japanese-English word alignment with the Basic Travel Expression Corpus ( $B T E C$ ) [35] and French-English word alignment with the Hansard dataset ( $H a n s a r d s$ ) from the 2003 NAACL shared task [23]. In addition, we evaluated the end-to-end translation performance of three tasks: a Chinese-to-English translation task with the FBIS corpus ( $F B I S$ ), the IWSLT 2007 Japanese-to-English translation task ( $I W S L T$ ) [10], and the NTCIR-9 Japanese-to-English patent translation task ( $N T C I R$ ) [13]⁶⁶We did not evaluate the translation performance on the Hansards data because the development data is very small and performance is unreliable..

Table 1 shows the sizes of our experimental datasets. Note that the development data was not used in the alignment tasks, i.e., $B T E C$ and $H a n s a r d s$ , because the hyperparameters of the alignment models were set by preliminary small-scale experiments. The $B T E C$ data is the first 9,960 sentence pairs in the training data for $I W S L T$ , which were annotated with word alignment [12]. We split these pairs into the first 9,000 for training data and the remaining 960 as test data. All the data in $B T E C$ is word-aligned, and the training data in $H a n s a r d s$ is unlabeled data. In $F B I S$ , we used the NIST02 evaluation data as the development data, and the NIST03 and 04 evaluation data as test data ( $NIST03$ and $NIST04$ ).

5.2 Comparing Methods

We evaluated the proposed RNN-based alignment models against two baselines: the IBM Model 4 and the FFNN-based model with one hidden layer. The IBM Model 4 was trained by previously presented model sequence schemes [28]: $1^{5}H^{5}3^{5}4^{5}$ , i.e., five iterations of the IBM Model 1 followed by five iterations of the HMM Model, etc., which is the default setting for GIZA++ ( $IBM4$ ). For the FFNN-based model, we set the word embedding length $M$ to 30, the number of units of a hidden layer $|z_{1}|$ to 100, and the window size of contexts to 5. Hence, $|z_{0}|$ is 300 ( $30\times 5\times 2$ ). Following Yang et al. [40], the FFNN-based model was trained by the supervised approach described in Section 2.2 ( $FFNN_{s}$ ).

For the RNN-based models, we set $M$ to 30 and the number of units of each recurrent hidden layer $|y_{j}|$ to 100. Thus, $|x_{j}|$ is 60 ( $30\times 2$ ). The number of units of each layer of the FFNN-based and RNN-based models and $M$ were set through preliminary experiments. To demonstrate the effectiveness of the proposed learning methods, we evaluated four types of RNN-based models: $RNN_{s}$ , $RNN_{s+c}$ , $RNN_{u}$ , and $RNN_{u+c}$ , where “ $s$ / $u$ ” denotes a supervised/unsupervised model and “ $+c$ ” indicates that the agreement constraint was used.

In training all the models except $IBM4$ , the weights of each layer were initialized first. For the weights of a lookup layer $L$ , we preliminarily trained word embeddings for the source and target language from each side of the training data. We then set the word embeddings to $L$ to avoid falling into local minima. Other weights were randomly initialized to $[-0.1,0.1]$ . For the pretraining, we used the RNNLM Toolkit ⁷⁷http://www.fit.vutbr.cz/~imikolov/rnnlm/ [24] with the default options. We mapped all words that occurred less than five times to the special token $\langle unk\rangle$ . Next, each weight was optimized using the mini-batch SGD, where batch size $D$ was 100, learning rate was 0.01, and an $l_{2}$ regularization parameter was 0.1. The training stopped after 50 epochs. The other parameters were set as follows: $W$ , $N$ and $C$ in the unsupervised learning were 100, 50, and 0.001, respectively, and $\alpha$ for the agreement constraint was 0.1.

In the translation tasks, we used the Moses phrase-based SMT systems [17]. All Japanese and Chinese sentences were segmented by ChaSen⁸⁸http://chasen-legacy.sourceforge.jp/ and the Stanford Chinese segmenter⁹⁹http://nlp.stanford.edu/software/segmenter.shtml, respectively. In the training, long sentences with over 40 words were filtered out. Using the SRILM Toolkits [33] with modified Kneser-Ney smoothing, we trained a 5-gram language model on the English side of each training data for $I W S L T$ and $N T C I R$ , and a 5-gram language model on the Xinhua portion of the English Gigaword corpus for $F B I S$ . The SMT weighting parameters were tuned by MERT [29] in the development data.

5.3 Word Alignment Results

Alignment	$B T E C$	$H a n s a r d s$
$IBM4$	0.4859	0.9029
$FFNN_{s}($ I $)$	0.4770	0.9020
$RNN_{s}($ I $)$	0.5053 ${}^{+}$	0.9068
$RNN_{s+c}($ I $)$	0.5174 ${}^{+}$	0.9202 ${}^{+}$
$RNN_{u}$	0.5307 ${}^{+}$	0.9037
$RNN_{u+c}$	0.5562 ${}^{+}$	0.9275 ${}^{+}$
$FFNN_{s}($ R $)$	0.8224	-
$RNN_{s}($ R $)$	0.8798 ${}^{+}$	-
$RNN_{s+c}($ R $)$	0.8921 ${}^{+}$	-

Table 2: Word alignment performance (F1-measure)

Table 2 shows the alignment performance by the F1-measure. Hereafter, $MODEL(R)$ and $MODEL(I)$ denote the $M O D E L$ trained from gold standard alignments and word alignments found by the IBM Model 4, respectively. In $H a n s a r d s$ , all models were trained from randomly sampled 100 K data¹⁰¹⁰Due to high computational cost, we did not use all the training data. Scaling up to larger datasets will be addressed in future work.. We evaluated the word alignments produced by first applying each model in both directions and then combining the alignments using the “grow-diag-final-and” heuristic [18]. The significance test on word alignment performance was performed by the sign test with a 5% significance level. “+” in Table 2 indicates that the comparisons are significant over corresponding baselines, $IBM4$ and $FFNN_{s}(R/I)$ .

In Table 2, $RNN_{u+c}$ , which includes all our proposals, i.e., the RNN-based model, the unsupervised learning, and the agreement constraint, achieves the best performance for both $B T E C$ and $H a n s a r d s$ . The differences from the baselines are statistically significant.

Table 2 shows that $RNN_{s}(R/I)$ outperforms $FFNN_{s}(R/I)$ , which is statistically significant in $B T E C$ . These results demonstrate that capturing the long alignment history in the RNN-based model improves the alignment performance. We discuss the difference of the RNN-based model’s effectiveness between language pairs in Section 6.1. Table 2 also shows that $RNN_{s+c}(R/I)$ and $RNN_{u+c}$ achieve significantly better performance than $RNN_{s}(R/I)$ and $RNN_{u}$ in both tasks, respectively. This indicates that the proposed agreement constraint is effective in training better models in both the supervised and unsupervised approaches.

In $B T E C$ , $RNN_{u}$ and $RNN_{u+c}$ significantly outperform $RNN_{s}(I)$ and $RNN_{s+c}(I)$ , respectively. The performance of these models is comparable with $H a n s a r d s$ . This indicates that our unsupervised learning benefits our models because the supervised models are adversely affected by errors in the automatically generated training data. This is especially true when the quality of training data, i.e., the performance of $IBM4$ , is low.

5.4 Machine Translation Results

Alignment	$I W S L T$	$N T C I R$	$F B I S$
Alignment	$I W S L T$	$N T C I R$	$NIST03$	$NIST04$
$IBM4_{all}$	46.47	27.91	25.90	28.34
$IBM4$	46.47	27.25	25.41	27.65
$FFNN_{s}($ I $)$	46.38	27.05	25.45	27.61
$RNN_{s}($ I $)$	46.43	27.24	25.47	27.56
$RNN_{s+c}($ I $)$	46.51	27.12	25.55	27.73
$RNN_{u}$	47.05 ${}^{*}$	27.79 ${}^{*}$	25.76 ${}^{*}$	27.91 ${}^{*}$
$RNN_{u+c}$	46.97 ${}^{*}$	27.76 ${}^{*}$	25.84 ${}^{*}$	28.20 ${}^{*}$

Table 3: Translation performance (BLEU4(%))

Table 3 shows the translation performance by the case sensitive BLEU4 metric¹¹¹¹We used mteval-v13a.pl as the evaluation tool (http://www.itl.nist.gov/iad/mig/tests/mt/2009/). [30]. Table 3 presents the average BLEU of three different MERT runs. In $N T C I R$ and $F B I S$ , each alignment model was trained from the randomly sampled 100 K data, and then a translation model was trained from all the training data that was word-aligned by the alignment model. In addition, for a detailed comparison, we evaluated the SMT system where the IBM Model 4 was trained from all the training data ( $IBM4_{all}$ ). The significance test on translation performance was performed by the bootstrap method [19] with a 5% significance level. “*” in Table 3 indicates that the comparisons are significant over both baselines, i.e., $IBM4$ and $FFNN_{s}(I)$ .

Table 3 also shows that better word alignment does not always result in better translation, which has been discussed previously [40]. However, $RNN_{u}$ and $RNN_{u+c}$ outperform $FFNN_{s}(I)$ and $IBM4$ in all tasks. These results indicate that our proposals contribute to improving translation performance¹²¹²We also confirmed the effectiveness of our models on the NIST05 and NTCIR-10 evaluation data.. In addition, Table 3 shows that these proposed models are comparable to $IBM4_{all}$ in $N T C I R$ and $F B I S$ even though the proposed models are trained from only a small part of the training data.

6 Discussion

6.1 Effectiveness of RNN-based Alignment Model

Figure 3: Word alignment examples

Figure 3 shows word alignment examples from $FFNN_{s}$ and $RNN_{s}$ , where solid squares indicate the gold standard alignments. Figure 3 (a) shows that $RRN_{s}$ adequately identifies complicated alignments with long distances compared to $FFNN_{s}$ (e.g., jaggy alignments of “have you been learning” in Fig 3 (a)) because $RNN_{s}$ captures alignment paths based on long alignment history, which can be viewed as phrase-level alignments, while $FFNN_{s}$ employs only the last alignment.

In French-English word alignment, the most valuable clues are located locally because English and French have similar word orders and their alignment has more one-to-one mappings than Japanese-English word alignment (Figure 3). Figure 3 (b) shows that both $RRN_{s}$ and $FFNN_{s}$ work for such simpler alignments. Therefore, the RNN-based model has less effect on French-English word alignment than Japanese-English word alignment, as indicated in Table 2.

6.2 Impact of Training Data Size

Alignment	40 K	9 K	1 K
$IBM4$	0.5467	0.4859	0.4128
$RNN_{u+c}$	0.6004	0.5562	0.4842
$RNN_{s+c}(R)$	-	0.8921	0.6063

Table 4: Word alignment performance on

B T E C

with various sized training data

Table 4 shows the alignment performance on $B T E C$ with various training data sizes, i.e., training data for $I W S L T$ (40 K), training data for $B T E C$ (9 K), and the randomly sampled 1 K data from the $B T E C$ training data. Note that $RNN_{s+c}(R)$ cannot be trained from the 40 K data because the 40 K data does not have gold standard word alignments.

Table 4 demonstrates that the proposed RNN-based model outperforms $IBM4$ trained from the unlabeled 40 K data by employing either the 1 K labeled data or the 9 K unlabeled data, which is less than 25% of the training data for $IBM4$ . Consequently, the SMT system using $RNN_{u+c}$ trained from a small part of training data can achieve comparable performance to that using $IBM4$ trained from all training data, which is shown in Table 3.

6.3 Effectiveness of Unsupervised Learning/Agreement Constraints

Alignment	$B T E C$	$H a n s a r d s$
$FFNN_{s}$ (I)	0.4770	0.9020
$FFNN_{s+c}$ (I)	0.4854 ${}^{+}$	0.9085 ${}^{+}$
$FFNN_{u}$	0.5105 ${}^{+}$	0.9026
$FFNN_{u+c}$	0.5313 ${}^{+}$	0.9144 ${}^{+}$
$FFNN_{s}$ (R)	0.8224	-
$FFNN_{s+c}$ (R)	0.8367 ${}^{+}$	-

Table 5: Word alignment performance of various FFNN-based models (F1-measure)

The proposed unsupervised learning and agreement constraints can be applied to any NN-based alignment model. Table 5 shows the alignment performance of the FFNN-based models trained by our supervised/unsupervised approaches (s/u) with and without our agreement constraints. In Table 5, “+c” denotes that the agreement constraint was used, and “+” indicates that the comparison with its corresponding baseline, i.e., $FFNN_{s}$ (I/R), is significant in the sign test with a 5% significance level.

Table 5 shows that $FFNN_{s+c}$ (R/I) and $FFNN_{u+c}$ achieve significantly better performance than $FFNN_{s}$ (R/I) and $FFNN_{u}$ , respectively, in both $B T E C$ and $H a n s a r d s$ . In addition, $FFNN_{u}$ and $FFNN_{u+c}$ significantly outperform $FFNN_{s}$ (I) and $FFNN_{s+c}$ (I), respectively, in $B T E C$ . The performance of these models is comparable in $H a n s a r d s$ . These results indicate that the proposed unsupervised learning and agreement constraint benefit the FFNN-based model, similar to the RNN-based model.

7 Conclusion

We have proposed a word alignment model based on an RNN, which captures long alignment history through recurrent architectures. Furthermore, we proposed an unsupervised method for training our model using NCE and introduced an agreement constraint that encourages word embeddings to be consistent across alignment directions. Our experiments have shown that the proposed model outperforms the FFNN-based model [40] for word alignment and machine translation, and that the agreement constraint improves alignment performance.

In future, we plan to employ contexts composed of surrounding words (e.g., $c(f_{j})$ or $c(e_{a_{j}})$ in the FFNN-based model) in our model, even though our model implicitly encodes such contexts in the alignment history. We also plan to enrich each hidden layer in our model with multiple layers following the success of Yang et al. [40], in which multiple hidden layers improved the performance of the FFNN-based model. In addition, we would like to prove the effectiveness of the proposed method for other datasets.

Acknowledgments

We thank the anonymous reviewers for their helpful suggestions and valuable comments on the first version of this paper.

References

[1] M. Auli, M. Galley, C. Quirk and G. Zweig(2013) Joint Language and Translation Modeling with Recurrent Neural Networks. pp. 1044–1054. Cited by: 1.
[2] Y. Bengio, R. Ducharme, P. Vincent and C. Janvin(2003) A Neural Probabilistic Language Model. Journal of Machine Learning Research 3, pp. 1137–1155. Cited by: 2.2.
[3] P. Blunsom and T. Cohn(2006) Discriminative Word Alignment with Conditional Random Fields. pp. 65–72. Cited by: 2.
[4] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra and R. L. Mercer(1993) The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19 (2), pp. 263–311. Cited by: Recurrent Neural Networks for Word Alignment Model, 1, 2.
[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011) Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: 2.2.
[6] R. Collobert and J. Weston(2008) A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. pp. 160–167. Cited by: 2.2.
[7] G. E. Dahl, D. Yu, L. Deng and A. Acero(2012) Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition. Audio, Speech, and Language Processing, IEEE Transactions on 20 (1), pp. 30–42. Cited by: 1, 2.2, 2.2.
[8] A. P. Dempster, N. M. Laird and D. B. Rubin(1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39 (1), pp. 1–38. Cited by: 2.1.
[9] C. Dyer, J. Clark, A. Lavie and N. A. Smith(2011) Unsupervised Word Alignment with Arbitrary Features. pp. 409–419. Cited by: 4.1.
[10] C. S. Fordyce(2007) Overview of the IWSLT 2007 Evaluation Campaign. pp. 1–12. Cited by: 5.1.
[11] K. Ganchev, J. V. Graça and B. Taskar(2008) Better Alignments = Better Translations?. pp. 986–993. Cited by: 1, 4.2.
[12] C. Goh, T. Watanabe, H. Yamamoto and E. Sumita(2010) Constraining a Generative Word Alignment Model with Discriminative Output. IEICE Transactions 93-D (7), pp. 1976–1983. Cited by: 5.1.
[13] I. Goto, B. Lu, K. P. Chow, E. Sumita and B. K. Tsou(2011) Overview of the Patent Machine Translation Task at the NTCIR-9 Workshop. pp. 559–578. Cited by: 5.1.
[14] J. V. Graça, K. Ganchev and B. Taskar(2008) Expectation Maximization and Posterior Constraints. pp. 569–576. Cited by: 1, 4.2.
[15] M. Gutmann and A. Hyvärinen(2010) Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models. pp. 297–304. Cited by: Recurrent Neural Networks for Word Alignment Model, 1.
[16] N. Kalchbrenner and P. Blunsom(2013) Recurrent Continuous Translation Models. pp. 1700–1709. Cited by: 1.
[17] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constrantin and E. Herbst(2007) Moses: Open Source Toolkit for Statistical Machine Translation. pp. 177–180. Cited by: 5.2.
[18] P. Koehn, F. J. Och and D. Marcu(2003) Statistical Phrase-Based Translation. pp. 48–54. Cited by: 5.3.
[19] P. Koehn(2004) Statistical Significance Tests for Machine Translation Evaluation. pp. 388–395. Cited by: 5.4.
[20] H. Le, A. Allauzen and F. Yvon(2012) Continuous Space Translation Models with Neural Networks. pp. 39–48. Cited by: 2.2.
[21] P. Liang, B. Taskar and D. Klein(2006) Alignment by Agreement. pp. 104–111. Cited by: 1, 4.2.
[22] E. Matusov, R. Zens and H. Ney(2004) Symmetric Word Alignments for Statistical Machine Translation. pp. 219–225. Cited by: 1, 4.2.
[23] R. Mihalcea and T. Pedersen(2003) An Evaluation Exercise for Word Alignment. pp. 1–10. Cited by: 5.1.
[24] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký and S. Khudanpur(2010) Recurrent Neural Network based Language Model. pp. 1045–1048. Cited by: 1, 5.2.
[25] T. Mikolov and G. Zweig(2012) Context Dependent Recurrent Neural Network Language Model. pp. 234–239. Cited by: 1.
[26] A. Mnih and Y. W. Teh(2012) A Fast and Simple Algorithm for Training Neural Probabilistic Language Models. pp. 1751–1758. Cited by: Recurrent Neural Networks for Word Alignment Model, 1.
[27] R. C. Moore(2005) A Discriminative Framework for Bilingual Word Alignment. pp. 81–88. Cited by: 2.
[28] F. J. Och and H. Ney(2003) A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29, pp. 19–51. Cited by: 2, 5.2.
[29] F. J. Och(2003) Minimum Error Rate Training in Statistical Machine Translation. pp. 160–167. Cited by: 5.2.
[30] K. Papineni, S. Roukos, T. Ward and W. Zhu(2002) BLEU: a Method for Automatic Evaluation of Machine Translation. pp. 311–318. Cited by: 5.4.
[31] D. E. Rumelhart, G. E. Hinton and R. J. Williams(1986) Learning Internal Representations by Error Propagation. in D. E. Rumelhart and J. L. McClelland (Eds.), Parallel Distributed Processing, pp. 318–362. Cited by: 2.2, 4.
[32] N. A. Smith and J. Eisner(2005) Contrastive Estimation: Training Log-Linear Models on Unlabeled Data. pp. 354–362. Cited by: 4.1.
[33] A. Stolcke(2002) SRILM - An Extensible Language Modeling Toolkit. pp. 901–904. Cited by: 5.2.
[34] M. Sundermeyer, I. Oparin, J. Gauvain, B. Freiberg, R. Schlüter and H. Ney(2013) Comparison of Feedforward and Recurrent Neural Network Language Models. pp. 8430–8434. Cited by: 1.
[35] T. Takezawa, E. Sumita, F. Sugaya, H. Yamamoto and S. Yamamoto(2002) Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World. pp. 147–152. Cited by: 5.1.
[36] B. Taskar, S. Lacoste-Julien and D. Klein(2005) A Discriminative Matching Approach to Word Alignment. pp. 73–80. Cited by: 2.
[37] A. Vaswani, L. Huang and D. Chiang(2012) Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the $l_{0}$ -norm. pp. 311–319. Cited by: 4.1.
[38] A. Vaswani, Y. Zhao, V. Fossum and D. Chiang(2013) Decoding with Large-Scale Neural Language Models Improves Translation. pp. 1387–1392. Cited by: 2.2.
[39] S. Vogel, H. Ney and C. Tillmann(1996) Hmm-based Word Alignment in Statistical Translation. pp. 836–841. Cited by: 1, 2.
[40] N. Yang, S. Liu, M. Li, M. Zhou and N. Yu(2013) Word Alignment Modeling with Context Dependent Deep Neural Network. pp. 166–175. Cited by: Recurrent Neural Networks for Word Alignment Model, 1, 1, 2.2, 2.2, 2.2, 4.2, 5.2, 5.4, 7, 7.

Generated on Tue Jun 10 19:27:01 2014 by LaTeXML [LOGO]