Learning Topic Representation for SMT with Neural Networks^†^†thanks: This work was done while the first and fourth authors were visiting Microsoft Research.

[ [ [ [ [ [ [

Abstract

Statistical Machine Translation (SMT) usually utilizes contextual information to disambiguate translation candidates. However, it is often limited to contexts within sentence boundaries, hence broader topical information cannot be leveraged. In this paper, we propose a novel approach to learning topic representation for parallel data using a neural network architecture, where abundant topical contexts are embedded via topic relevant monolingual data. By associating each translation rule with the topic representation, topic relevant rules are selected according to the distributional similarity with the source text during SMT decoding. Experimental results show that our method significantly improves translation accuracy in the NIST Chinese-to-English translation task compared to a state-of-the-art baseline.

1]LeiCui2]DongdongZhang2]ShujieLiu3]QimingChen2]MuLi2]MingZhou1]MuyunYang\affil[1]SchoolofComputerScienceandTechnology,HarbinInstituteofTechnology,Harbin,P.R.China\authorcrleicui@hit.edu.cn,ymy@mtlab.hit.edu.cn\affil[2]MicrosoftResearch,Beijing,P.R.China\authorcr{dozhang,shujliu,muli,mingzhou}@microsoft.com\affil[3]ShanghaiJiaoTongUniversity,Shanghai,P.R.China\authorcrsimoncqm@gmail.com

{CJK}

UTF8gkai

1 Introduction

Making translation decisions is a difficult task in many Statistical Machine Translation (SMT) systems. Current translation modeling approaches usually use context dependent information to disambiguate translation candidates. For example, translation sense disambiguation approaches [4, 5] are proposed for phrase-based SMT systems. Meanwhile, for hierarchical phrase-based or syntax-based SMT systems, there is also much work involving rich contexts to guide rule selection [15, 21, 23, 35]. Although these methods are effective and proven successful in many SMT systems, they only leverage within-sentence contexts which are insufficient in exploring broader information. For example, the word driver often means “the operator of a motor vehicle” in common texts. But in the sentence “Finally, we write the user response to the buffer, i.e., pass it to our driver”, we understand that driver means “computer program”. In this case, people understand the meaning because of the IT topical context which goes beyond sentence-level analysis and requires more relevant knowledge. Therefore, it is important to leverage topic information to learn smarter translation models and achieve better translation performance.

Topic modeling is a useful mechanism for discovering and characterizing various semantic concepts embedded in a collection of documents. Attempts on topic-based translation modeling include topic-specific lexicon translation models [37, 38], topic similarity models for synchronous rules [34], and document-level translation with topic coherence [36]. In addition, topic-based approaches have been used in domain adaptation for SMT [31, 30], where they view different topics as different domains. One typical property of these approaches in common is that they only utilize parallel data where document boundaries are explicitly given. In this way, the topic of a sentence can be inferred with document-level information using off-the-shelf topic modeling toolkits such as Latent Dirichlet Allocation (LDA) [3] or Hidden Topic Markov Model (HTMM) [14]. Most of them also assume that the input must be in document level. However, this situation does not always happen since there is considerable amount of parallel data which does not have document boundaries. In addition, contemporary SMT systems often works on sentence level rather than document level due to the efficiency. Although we can easily apply LDA at the sentence level, it is quite difficult to infer the topic accurately with only a few words in the sentence. This makes previous approaches inefficient when applied them in real-world commercial SMT systems. Therefore, we need to devise a systematical approach to enriching the sentence and inferring its topic more accurately.

In this paper, we propose a novel approach to learning topic representations for sentences. Since the information within the sentence is insufficient for topic modeling, we first enrich sentence contexts via Information Retrieval (IR) methods using content words in the sentence as queries, so that topic-related monolingual documents can be collected. These topic-related documents are utilized to learn a specific topic representation for each sentence using a neural network based approach. Neural network is an effective technique for learning different levels of data representations. The levels inferred from neural network correspond to distinct levels of concepts, where high-level representations are obtained from low-level bag-of-words input. It is able to detect correlations among any subset of input features through non-linear transformations, which demonstrates the superiority of eliminating the effect of noisy words which are irrelevant to the topic. Our problem fits well into the neural network framework and we expect that it can further improve inferring the topic representations for sentences.

To incorporate topic representations as translation knowledge into SMT, our neural network based approach directly optimizes similarities between the source language and target language in a compact topic space. This underlying topic space is learned from sentence-level parallel data in order to share topic information across the source and target languages as much as possible. Additionally, our model can be discriminatively trained with a large number of training instances, without expensive sampling methods such as in LDA or HTMM, thus it is more practicable and scalable. Finally, we associate the learned representation to each bilingual translation rule. Topic-related rules are selected according to distributional similarity with the source text, which helps hypotheses generation in SMT decoding. We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task. Experimental results demonstrate that our model significantly improves translation accuracy over a state-of-the-art baseline.

2 Background: Deep Learning

Deep learning is an active topic in recent years which has triumphed in many machine learning research areas. This technique began raising public awareness in the mid-2000s after researchers showed how a multi-layer feed-forward neural network can be effectively trained. The training procedure often involves two phases: a layer-wise unsupervised pre-training phase and a supervised fine-tuning phase. For pre-training, Restricted Boltzmann Machine (RBM) [16], auto-encoding [1] and sparse coding [20] are most frequently used. Unsupervised pre-training trains the network one layer at a time and helps guide the parameters of the layer towards better regions in parameter space [2]. Followed by fine-tuning in this parameter region, deep learning is able to achieve state-of-the-art performance in various research areas, including breakthrough results on the ImageNet dataset for objective recognition [19], significant error reduction in speech recognition [10], etc.

Deep learning has also been successfully applied in a variety of NLP tasks such as part-of-speech tagging, chunking, named entity recognition, semantic role labeling [8], parsing [28], sentiment analysis [29], etc. Most NLP research converts a high-dimensional and sparse binary representation into a low-dimensional and real-valued representation. This low-dimensional representation is usually learned from huge amount of monolingual texts in the pre-training phase, and then fine-tuned towards task-specific criterion. Inspired by previous successful research, we first learn sentence representations using topic-related monolingual texts in the pre-training phase, and then optimize the bilingual similarity by leveraging sentence-level parallel data in the fine-tuning phase.

3 Topic Similarity Model with Neural Network

In this section, we explain our neural network based topic similarity model in detail, as well as how to incorporate the topic similarity features into SMT decoding procedure. Figure 1 sketches the high-level overview which illustrates how to learn topic representations using sentence-level parallel data. Given a parallel sentence pair $\langle f,e\rangle$ , the first step is to treat $f$ and $e$ as queries, and use IR methods to retrieve relevant documents to enrich contextual information for them. Specifically, the ranking model we used is a Vector Space Model (VSM), where the query and document are converted into tf-idf weighted vectors. The most relevant $N$ documents $\textbf{d}_{f}$ and $\textbf{d}_{e}$ are retrieved and converted to a high-dimensional, bag-of-words input f and e for the representation learning¹¹We use f and e to denote the $n$ -of- $V$ vector converted from the retrieved documents..

Figure 1: Overview of neural network based topic similarity model.

There are two phases in our neural network training process: pre-training and fine-tuning. In the pre-training phase (Section 3.1), we build two neural networks with the same structure but different parameters to learn a low-dimensional representation for sentences in two different languages. Then, in the fine-tuning phase (Section 3.2), our model directly optimizes the similarity of two low-dimensional representations, so that it highly correlates to SMT decoding. Finally, the learned representation is used to calculate similarities which are integrated as features in SMT decoding procedure (Section 3.3).

3.1 Pre-training using denoising auto-encoder

In the pre-training phase, we leverage neural network structures to transform high-dimensional sparse vectors to low-dimensional dense vectors. The topic similarity is calculated on top of the learned dense vectors. This dense representation should preserve the information from the bag-of-words input, meanwhile alleviate data sparse problem. Therefore, we use a specially designed mechanism called auto-encoder to solve this problem. Auto-encoder [1] is one of the basic building blocks of deep learning. Assuming that the input is a $n$ -of- $V$ binary vector x representing the bag-of-words ( $V$ is the vocabulary size), an auto-encoder consists of an encoding process $g(\textbf{x})$ and a decoding process $h(g(\textbf{x}))$ . The objective of the auto-encoder is to minimize the reconstruction error $\mathcal{L}(h(g(\textbf{x})),\textbf{x})$ . Our goal is to learn a low-dimensional vector which can preserve information from the original $n$ -of- $V$ vector.

One problem with auto-encoder is that it treats all words in the same way, making no distinguishment between function words and content words. The representation learned by auto-encoders tends to be influenced by the function words, thereby it is not robust. To alleviate this problem, Vincent et al. (2008) proposed the Denoising Auto-Encoder (DAE), which aims to reconstruct a clean, “repaired” input from a corrupted, partially destroyed vector. This is done by corrupting the initial input x to get a partially destroyed version $\tilde{\textbf{x}}$ . DAE is capable of capturing the global structure of the input while ignoring the noise. In our task, for each sentence, we treat the retrieved $N$ relevant documents as a single large document and convert it to a bag-of-words vector x in Figure 2. With DAE, the input x is manually corrupted by applying masking noise (randomly mask 1 to 0) and getting $\tilde{\textbf{x}}$ . Denoising training is considered as “filling in the blanks” [33], which means the masking components can be recovered from the non-corrupted components. For example, in IT related texts, if the word driver is masked, it should be predicted through hidden units in neural networks by active signals such as “buffer”, “user response”, etc.

Figure 2: Denoising auto-encoder with a bag-of-words input.

In our case, the encoding process transforms the corrupted input $\tilde{\textbf{x}}$ into $g(\tilde{\textbf{x}})$ with two layers: a linear layer connected with a non-linear layer. Assuming that the dimension of the $g(\tilde{\textbf{x}})$ is $L$ , the linear layer forms a $L\times V$ matrix $W$ which projects the $n$ -of- $V$ vector to a $L$ -dimensional hidden layer. After the bag-of-words input has been transformed, they are fed into a subsequent layer to model the highly non-linear relations among words:

\textbf{z}=f(W\tilde{\textbf{x}}+\textbf{b})

(1)

where z is the output of the non-linear layer, b is a $L$ -length bias vector. $f(\cdot)$ is a non-linear function, where common choices include sigmoid function, hyperbolic function, “hard” hyperbolic function, rectifier function, etc. In this work, we use the rectifier function as our non-linear function due to its efficiency and better performance [13]:

rec(x)=\begin{cases}x&\text{if $x>0$}\\ 0&\text{otherwise}\end{cases}

(2)

The decoding process consists of a linear layer and a non-linear layer with similar network structures, but different parameters. It transforms the $L$ -dimensional vector $g(\tilde{\textbf{x}})$ to a $V$ -dimensional vector $h(g(\tilde{\textbf{x}}))$ . To minimize reconstruction error with respect to $\tilde{\textbf{x}}$ , we define the loss function as the L2-norm of the difference between the uncorrupted input and reconstructed input:

\mathcal{L}(h(g(\tilde{\textbf{x}})),\textbf{x})=\|h(g(\tilde{\textbf{x}}))-% \textbf{x}\|_{2}

(3)

Multi-layer neural networks are trained with the standard back-propagation algorithm [26]. The gradient of the loss function is calculated and back-propagated to the previous layer to update its parameters. Training neural networks involves many factors such as the learning rate and the length of hidden layers. We will discuss the optimization of these parameters in Section 4.

3.2 Fine-tuning with parallel data

In the fine-tuning phase, we stack another layer on top of the two low-dimensional vectors to maximize the similarity between source and target languages. The similarity scores are integrated into the standard log-linear model for making translation decisions. Since the vectors from DAE are trained using information from monolingual training data independently, these vectors may be inadequate to measure bilingual topic similarity due to their different topic spaces. Therefore, in this stage, parallel sentence pairs are used to help connecting the vectors from different languages because they express the same topic. In fact, the objective of fine-tuning is to discover a latent topic space which is shared by both languages as much as possible. This shared topic space is particularly useful when the SMT decoder tries to match the source texts and translation candidates in the target language.

Given a parallel sentence pair $\langle f,e\rangle$ , the DAE learns representations for $f$ and $e$ respectively, as $\textbf{z}_{f}=g(\textbf{f})$ and $\textbf{z}_{e}=g(\textbf{e})$ in Figure 1. We then take two vectors as the input to calculate their similarity. Consequently, the whole neural network can be fine-tuned towards the supervised criteria with the help of parallel data. The similarity score of the representation pair $\langle\textbf{z}_{f},\textbf{z}_{e}\rangle$ is defined as the cosine similarity of the two vectors:

\begin{split}sim(f,e)&=\cos(\textbf{z}_{f},\textbf{z}_{e})\\ &=\frac{\textbf{z}_{f}\cdot\textbf{z}_{e}}{\|\textbf{z}_{f}\|\|\textbf{z}_{e}% \|}\end{split}

(4)

Since a parallel sentence pair should have the same topic, our goal is to maximize the similarity score between the source sentence and target sentence. Inspired by the contrastive estimation method [27], for each parallel sentence pair $\langle f,e\rangle$ as a positive instance, we select another sentence pair $\langle f^{\prime},e^{\prime}\rangle$ from the training data and treat $\langle f,e^{\prime}\rangle$ as a negative instance. To make the similarity of the positive instance larger than the negative instance by some margin $\eta$ , we utilize the following pairwise ranking loss:

\mathcal{L}(f,e)=\max\{0,\eta-sim(f,e)+sim(f,e^{\prime})\}

(5)

where $\eta=\frac{1}{2}-sim(f,f^{\prime})$ . The rationale behind this criterion is, the smaller $sim(f,f^{\prime})$ is, the more we should penalize negative instances.

To effectively train the model in this task, negative instances must be selected carefully. Since different sentences may have very similar topic distributions, we select negative instances that are dissimilar with the positive instances based on the following criteria:

1.

For each positive instance $\langle f,e\rangle$ , we select $e^{\prime}$ which contains at least 30% different content words from $e$ .
2.

If we cannot find such $e^{\prime}$ , remove $\langle f,e\rangle$ from the training instances for network learning.

The model minimizes the pairwise ranking loss across all training instances:

\mathcal{L}=\sum_{\langle f,e\rangle}{\mathcal{L}(f,e)}

(6)

We used standard back-propagation algorithm to further fine-tune the neural network parameters $W$ and b in Equation (1). The learned neural networks are used to obtain sentence topic representations, which will be further leveraged to infer topic representations of bilingual translation rules.

3.3 Integration into SMT decoding

We incorporate the learned topic similarity scores into the standard log-linear framework for SMT. When a synchronous rule $\langle\alpha,\gamma\rangle$ is extracted from a sentence pair $\langle f,e\rangle$ , a triple instance $\mathcal{I}=(\langle\alpha,\gamma\rangle,\langle f,e\rangle,c)$ is collected for inferring the topic representation of $\langle\alpha,\gamma\rangle$ , where $c$ is the count of rule occurrence. Following [7], we give a count of one for each phrase pair occurrence and a fractional count for each hierarchical phrase pair. The topic representation of $\langle\alpha,\gamma\rangle$ is then calculated as the weighted average:

\textbf{z}_{\alpha}=\frac{\sum_{(\langle\alpha,\gamma\rangle,\langle f,e% \rangle,c)\in\mathcal{T}}{\{c\times\textbf{z}_{f}\}}}{\sum_{(\langle\alpha,% \gamma\rangle,\langle f,e\rangle,c)\in\mathcal{T}}{\{c\}}}

(7)

\textbf{z}_{\gamma}=\frac{\sum_{(\langle\alpha,\gamma\rangle,\langle f,e% \rangle,c)\in\mathcal{T}}{\{c\times\textbf{z}_{e}\}}}{\sum_{(\langle\alpha,% \gamma\rangle,\langle f,e\rangle,c)\in\mathcal{T}}{\{c\}}}

(8)

where $\mathcal{T}$ denotes all instances for the rule $\langle\alpha,\gamma\rangle$ , $\textbf{z}_{\alpha}$ and $\textbf{z}_{\gamma}$ are the source-side and target-side topic vectors respectively.

By measuring the similarity between the source texts and bilingual translation rules, the SMT decoder is able to encourage topic relevant translation candidates and penalize topic irrelevant candidates. Therefore, it helps to train a smarter translation model with the embedded topic information. Given a source sentence $s$ to be translated, we define the similarity as follows:

Sim(\textbf{z}_{s},\textbf{z}_{\alpha})=\cos(\textbf{z}_{s},\textbf{z}_{\alpha})

(9)

Sim(\textbf{z}_{s},\textbf{z}_{\gamma})=\cos(\textbf{z}_{s},\textbf{z}_{\gamma})

(10)

where $\textbf{z}_{s}$ is the topic representation of $s$ . The similarity calculated against $\textbf{z}_{\alpha}$ or $\textbf{z}_{\gamma}$ denotes the source-to-source or the source-to-target similarity.

We also consider the topic sensitivity estimation since general rules have flatter distributions while topic-specific rules have sharper distributions. A standard entropy metric is used to measure the sensitivity of the source-side of $\langle\alpha,\gamma\rangle$ as:

Sen(\alpha)=-\sum_{i=1}^{|\textbf{z}_{\alpha}|}{z_{\alpha i}\times\log z_{% \alpha i}}

(11)

where $z_{\alpha i}$ is a component in the vector $\textbf{z}_{\alpha}$ . The target-side sensitivity $Sen(\gamma)$ can be calculated in a similar way. The larger the sensitivity is, the more topic-specific the rule manifests.

In addition to traditional SMT features, we add new topic-related features into the standard log-linear framework. For the SMT system, the best translation candidate $\hat{e}$ is given by:

\hat{e}=\operatorname*{arg\,max}_{e}{P(e|f)}

(12)

where the translation probability is given by:

\begin{split}P(e|f)&\propto\sum_{i}w_{i}\cdot\log\phi_{i}(f,e)\\ &=\underbrace{\sum_{j}w_{j}\cdot\log\phi_{j}(f,e)}_{\textrm{Standard}}+% \underbrace{\sum_{k}w_{k}\cdot\log\phi_{k}(f,e)}_{\textrm{Topic related}}\end{split}

(13)

where $\phi_{j}(f,e)$ is the standard feature function and $w_{j}$ is the corresponding feature weight. $\phi_{k}(f,e)$ is the topic-related feature function and $w_{k}$ is the feature weight. The detailed feature description is as follows:

Standard features: Translation model, including translation probabilities and lexical weights for both directions (4 features), 5-gram language model (1 feature), word count (1 feature), phrase count (1 feature), NULL penalty (1 feature), number of hierarchical rules used (1 feature).

Topic-related features: rule similarity scores (2 features), rule sensitivity scores (2 features).

4 Experiments

4.1 Setup

We evaluate the performance of our neural network based topic similarity model on a Chinese-to-English machine translation task. In neural network training, a large number of monolingual documents are collected in both source and target languages. The documents are mainly from two domains: news and weblog. We use Chinese and English Gigaword corpus (Version 5) which are mainly from news domain. In addition, we also collect weblog documents with a variety of topics from the web. The total data statistics are presented in Table 1. These documents are built in the format of inverted index using Lucene²²http://lucene.apache.org/, which can be efficiently retrieved by the parallel sentence pairs. The most relevant $N$ documents are collected, where we experiment with $N=\{1,5,10,20,50\}$ .

Domain	Chinese		English
Domain	Docs	Words	Docs	Words
News	5.7M	5.4B	9.9M	25.6B
Weblog	2.1M	8B	1.2M	2.9B
Total	7.8M	13.4B	11.1M	28.5B

Table 1: Statistics of monolingual data, in numbers of documents and words (main content). “M” refers to million and “B” refers to billion.

We implement a distributed framework to speed up the training process of neural networks. The network is learned with mini-batch asynchronous gradient descent with the adaptive learning rate procedure called AdaGrad [11]. We use 32 model replicas in each iteration during the training. The model parameters are averaged after each iteration and sent to each replica for the next iteration. The vocabulary size for the input layer is 100,000, and we choose different lengths for the hidden layer as $L=\{100,300,600,1000\}$ in the experiments. In the pre-training phase, all parallel data is fed into two neural networks respectively for DAE training, where network parameters $W$ and b are randomly initialized. In the fine-tuning phase, for each parallel sentence pair, we randomly select other ten sentence pairs which satisfy the criterion as negative instances. These training instances are leveraged to optimize the similarity of two vectors.

In SMT training, an in-house hierarchical phrase-based SMT decoder is implemented for our experiments. The CKY decoding algorithm is used and cube pruning is performed with the same default parameter settings as in Chiang (2007). The parallel data we use is released by LDC³³LDC2003E14, LDC2002E18, LDC2003E07, LDC2005T06, LDC2005T10, LDC2005E83, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006E26, LDC2007T09. In total, the datasets contain nearly 1.1 million sentence pairs. Translation models are trained over the parallel data that is automatically word-aligned using GIZA++ in both directions, and the diag-grow-final heuristic is used to refine symmetric word alignment. An in-house language modeling toolkit is used to train the 5-gram language model with modified Kneser-Ney smoothing [17]. The English monolingual data used for language modeling is the same as in Table 1. The NIST 2003 dataset is the development data. The testing data consists of NIST 2004, 2005, 2006 and 2008 datasets. The evaluation metric for the overall translation quality is case-insensitive BLEU4 [25]. The reported BLEU scores are averaged over 5 times of running MERT [24]. A statistical significance test is performed using the bootstrap re-sampling method [18].

4.2 Baseline

Figure 3: End-to-end translation results (BLEU%) using all standard and topic-related features, with different settings on the number of retrieved documents

N

and the length of hidden layers

L

The baseline is a re-implementation of the Hiero system [7]. The phrase pairs that appear only once in the parallel data are discarded because most of them are noisy. We also use the fix-discount method in Foster et al. (2006) for phrase table smoothing. This implementation makes the system perform much better and the translation model size is much smaller.

We compare our method with the LDA-based approach proposed by Xiao et al. (2012). In [34], the topic of each sentence pair is exactly the same as the document it belongs to. Since some of our parallel data does not have document-level information, we rely on the IR method to retrieve the most relevant document and simulate this approach. The PLDA toolkit [22] is used to infer topic distributions, which takes 34.5 hours to finish.

4.3 Effect of retrieved documents and length of hidden layers

We illustrate the relationship among translation accuracy (BLEU), the number of retrieved documents ( $N$ ) and the length of hidden layers ( $L$ ) on different testing datasets. The results are shown in Figure 3. The best translation accuracy is achieved when $N$ =10 for most settings. This confirms that enriching the source text with topic-related documents is very useful in determining topic representations, thereby help to guide the synchronous rule selection. However, we find that as $N$ becomes larger in the experiments, e.g. $N$ =50, the translation accuracy drops drastically. As more documents are retrieved, less relevant information is also used to train the neural networks. Irrelevant documents bring so many unrelated topic words hence degrade neural network learning performance.

Another important factor is the length of hidden layers $L$ in the network. In deep learning, this parameter is often empirically tuned with human efforts. As shown in Figure 3, the translation accuracy is better when $L$ is relatively small. Actually, there is no obvious distinction of the performance when $L$ is less than 600. However, when $L$ equals 1,000, the translation accuracy is inferior to other settings. The main reason is that parameters in the neural networks are too many to be effectively trained. As we know when $L$ =1000, there are a total of $100,000\times 1,000$ parameters between the linear and non-linear layers in the network. Limited training data prevents the model from getting close to the global optimum. Therefore, the model is likely to fall in local optima and lead to unacceptable representations.

4.4 Effect of topic related features

Settings	NIST 2004	NIST 2005	NIST 2006	NIST 2008	Average
Baseline	42.25	41.21	38.05	31.16	38.17
[34]	42.58	41.61	38.39	31.58	38.54
Sim(Src)	42.51	41.55	38.53	31.57	38.54
Sim(Trg)	42.43	41.48	38.4	31.49	38.45
Sim(Src+Trg)	42.7	41.66	38.66	31.66	38.67
Sim(Src+Trg)+Sen(Src)	42.77	41.81	38.85	31.73	38.79
Sim(Src+Trg)+Sen(Trg)	42.85	41.79	38.76	31.7	38.78
Sim(Src+Trg)+Sen(Src+Trg)	42.95	41.97	38.91	31.88	38.93

Table 2: Effectiveness of different features in BLEU% (

\textit{p}<0.05

), with

N

=10 and

L

=100. “Sim” denotes the rule similarity feature and “Sen” denotes rule sensitivity feature. “Src” and “Trg” means utilizing source-side/target-side rule topic vectors to calculate similarity or sensitivity, respectively. The “Average” setting is the averaged result of four datasets.

We evaluate the performance of adding new topic-related features to the log-linear model and compare the translation accuracy with the method in [34]. To make different methods comparable, we set the dimension of topic representation as 100 for all settings. This takes 10 hours in pre-training phase and 22 hours in fine-tuning phase. Table 2 shows how the accuracy is improved with more features added. The results confirm that topic information is indispensable for SMT since both [34] and our neural network based method significantly outperforms the baseline system. Our method improves 0.86 BLEU points at most and 0.76 BLEU points on average over the baseline. We observe that source-side similarity is more effective than target-side similarity, but their contributions are cumulative. This proves that bilingually induced topic representation with neural network helps the SMT system disambiguate translation candidates. Furthermore, rule sensitivity features improve SMT performance compared with only using similarity features. Because topic-specific rules usually have a larger sensitivity score, they can beat general rules when they obtain the same similarity score against the input sentence. Finally, when all new features are integrated, the performance is the best, preforming substantially better than [34] with 0.39 BLEU points on average.

It is worth mentioning that the performance of [34] is similar to the settings with $N$ =1 and $L$ =100 in Figure 3. This is not simply coincidence since we can interpret their approach as a special case in our neural network method: when a parallel sentence pair has document-level information, that document will be retrieved for training; otherwise, the most relevant document will be retrieved from the monolingual data. Therefore, our method can be viewed as a more general framework than previous LDA-based approaches.

4.5 Discussion

In this section, we give a case study to explain why our method works. An example of translation rule disambiguation for a sentence from the NIST 2005 dataset is shown in Figure 4. We find that the topic of this sentence is about “rescue after a natural disaster”. Under this topic, the Chinese rule “åé X” should be translated to “deliver X” or “distribute X”. However, the baseline system prefers “send X” rather than those two candidates. Although the translation probability of “send X” is much higher, it is inappropriate in this context since it is usually used in IT texts. For example, $\langle$ åéé®ä»¶, send emails $\rangle$ , $\langle$ åéä¿¡æ¯, send messages $\rangle$ and $\langle$ åéæ°æ®, send data $\rangle$ . In contrast, with our neural network based approach, the learned topic distributions of “deliver X” or “distribute X” are more similar with the input sentence than “send X”, which is shown in Figure 4. The similarity scores indicate that “deliver X” and “distribute X” are more appropriate to translate the sentence. Therefore, adding topic-related features is able to keep the topic consistency and substantially improve the translation accuracy.

Figure 4: An example from the NIST 2005 dataset. We illustrate the normalized topic representations of the source sentence and three ambiguous synchronous rules. Details are explained in Section 4.5.

5 Related Work

Topic modeling was first leveraged to improve SMT performance in [37, 38]. They proposed a bilingual topical admixture approach for word alignment and assumed that each word-pair follows a topic-specific model. They reported extensive empirical analysis and improved word alignment accuracy as well as translation quality. Following this work, [34] extended topic-specific lexicon translation models to hierarchical phrase-based translation models, where the topic information of synchronous rules was directly inferred with the help of document-level information. Experiments show that their approach not only achieved better translation performance but also provided a faster decoding speed compared with previous lexicon-based LDA methods.

Another direction of approaches leveraged topic modeling techniques for domain adaptation. Tam et al. (2007) used bilingual LSA to learn latent topic distributions across different languages and enforce one-to-one topic correspondence during model training. They incorporated the bilingual topic information into language model adaptation and lexicon translation model adaptation, achieving significant improvements in the large-scale evaluation. [30] investigated the relationship between out-of-domain bilingual data and in-domain monolingual data via topic mapping using HTMM methods. They estimated phrase-topic distributions in translation model adaptation and generated better translation quality. Recently, Chen et al. (2013) proposed using vector space model for adaptation where genre resemblance is leveraged to improve translation accuracy. We also investigated multi-domain adaptation where explicit topic information is used to train domain specific models [9].

Generally, most previous research has leveraged conventional topic modeling techniques such as LDA or HTMM. In our work, a novel neural network based approach is proposed to infer topic representations for parallel data. The advantage of our method is that it is applicable to both sentence-level and document-level SMT, since we do not place any restrictions on the input. In addition, our method directly maximizes the similarity between parallel sentence pairs, which is ideal for SMT decoding. Compared to document-level topic modeling which uses the topic of a document for all sentences within the document [34], our contributions are:

•

We proposed a more general approach to leveraging topic information for SMT by using IR methods to get a collection of related documents, regardless of whether or not document boundaries are explicitly given.
•

We used neural networks to learn topic representations more accurately, with more practicable and scalable modeling techniques.
•

We directly optimized bilingual topic similarity in the deep learning framework with the help of sentence-level parallel data, so that the learned representation could be easily used in SMT decoding procedure.

6 Conclusion and Future Work

In this paper, we propose a neural network based approach to learning bilingual topic representation for SMT. We enrich contexts of parallel sentence pairs with topic related monolingual data and obtain a set of documents to represent sentences. These documents are converted to a bag-of-words input and fed into neural networks. The learned low-dimensional vector is used to obtain the topic representations of synchronous rules. In SMT decoding, appropriate rules are selected to best match source texts according to their similarity in the topic space. Experimental results show that our approach is promising for SMT systems to learn a better translation model. It is a significant improvement over the state-of-the-art Hiero system, as well as a conventional LDA-based method.

In the future research, we will extend our neural network methods to address document-level translation, where topic transition between sentences is a crucial problem to be solved. Since the translation of the current sentence is usually influenced by the topic of previous sentences, we plan to leverage recurrent neural networks to model this phenomenon, where the history translation information is naturally combined in the model.

Acknowledgments

We are grateful to the anonymous reviewers for their insightful comments. We also thank Fei Huang (BBN), Nan Yang, Yajuan Duan, Hong Sun and Duyu Tang for the helpful discussions. This work is supported by the National Natural Science Foundation of China (Granted No. 61272384)

References

[1] Y. Bengio, P. Lamblin, D. Popovici and H. Larochelle(2006) Greedy layer-wise training of deep networks. in B. Schölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, pp. 153–160. Cited by: 2, 3.1.
[2] Y. Bengio(2009-01) Learning deep architectures for ai. Found. Trends Mach. Learn. 2 (1), pp. 1–127. External Links: ISSN 1935-8237, Link, Document Cited by: 2.
[3] D. M. Blei, A. Y. Ng and M. I. Jordan(2003-03) Latent dirichlet allocation. J. Mach. Learn. Res. 3, pp. 993–1022. External Links: ISSN 1532-4435, Link Cited by: 1.
[4] M. Carpuat and D. Wu(2005-06) Word sense disambiguation vs. statistical machine translation. Ann Arbor, Michigan, pp. 387–394. External Links: Link, Document Cited by: 1.
[5] M. Carpuat and D. Wu(2007) Context-dependent phrasal translation lexicons for statistical machine translation. Proceedings of Machine Translation Summit XI, pp. 73–80. Cited by: 1.
[6] B. Chen, R. Kuhn and G. Foster(2013-08) Vector space model for adaptation in statistical machine translation. Sofia, Bulgaria, pp. 1285–1293. External Links: Link Cited by: 5.
[7] D. Chiang(2007) Hierarchical phrase-based translation. Computational Linguistics 33 (2), pp. 201–228. Cited by: 3.3, 4.1, 4.2.
[8] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011-11) Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, pp. 2493–2537. External Links: ISSN 1532-4435, Link Cited by: 2.
[9] L. Cui, X. Chen, D. Zhang, S. Liu, M. Li and M. Zhou(2013-10) Multi-domain adaptation for SMT using multi-task learning. Seattle, Washington, USA, pp. 1055–1065. External Links: Link Cited by: 5.
[10] G. E. Dahl, D. Yu, L. Deng and A. Acero(2012-01) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing 20 (1), pp. 30–42. External Links: ISSN 1558-7916, Link, Document Cited by: 2.
[11] J. Duchi, E. Hazan and Y. Singer(2011-07) Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, pp. 2121–2159. External Links: ISSN 1532-4435, Link Cited by: 4.1.
[12] G. Foster, R. Kuhn and H. Johnson(2006-07) Phrasetable smoothing for statistical machine translation. Sydney, Australia, pp. 53–61. External Links: Link Cited by: 4.2.
[13] X. Glorot, A. Bordes and Y. Bengio(2011) Deep sparse rectifier networks. Vol. 15, pp. 315–323. Cited by: 3.1.
[14] A. Gruber, M. Rosen-zvi and Y. Weiss(2007) Hidden topic markov models. Cited by: 1.
[15] Z. He, Q. Liu and S. Lin(2008-08) Improving statistical machine translation using lexicalized rule selection. Manchester, UK, pp. 321–328. External Links: Link Cited by: 1.
[16] G. E. Hinton, S. Osindero and Y. Teh(2006-07) A fast learning algorithm for deep belief nets. Neural Comput. 18 (7), pp. 1527–1554. External Links: ISSN 0899-7667, Link, Document Cited by: 2.
[17] R. Kneser and H. Ney(1995) Improved backing-off for m-gram language modeling. Vol. 1, pp. 181–184. Cited by: 4.1.
[18] P. Koehn(2004-07) Statistical significance tests for machine translation evaluation. Barcelona, Spain, pp. 388–395. Cited by: 4.1.
[19] A. Krizhevsky, I. Sutskever and G. Hinton(2012) ImageNet classification with deep convolutional neural networks. in P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou and K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, pp. 1106–1114. External Links: Link Cited by: 2.
[20] H. Lee, A. Battle, R. Raina and A. Y. Ng(2006) Efficient sparse coding algorithms. in B. Schölkopf, J. Platt and T. Hoffman (Eds.), Advances in Neural Information Processing Systems 19, pp. 801–808. Cited by: 2.
[21] Q. Liu, Z. He, Y. Liu and S. Lin(2008-10) Maximum entropy based rule selection model for syntax-based statistical machine translation. Honolulu, Hawaii, pp. 89–97. External Links: Link Cited by: 1.
[22] Z. Liu, Y. Zhang, E. Y. Chang and M. Sun(2011) PLDA+: parallel latent dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning. Note: Software available at \urlhttp://code.google.com/p/plda Cited by: 4.2.
[23] Y. Marton and P. Resnik(2008-06) Soft syntactic constraints for hierarchical phrased-based translation. Columbus, Ohio, pp. 1003–1011. External Links: Link Cited by: 1.
[24] F. J. Och(2003-07) Minimum error rate training in statistical machine translation. Sapporo, Japan, pp. 160–167. External Links: Link, Document Cited by: 4.1.
[25] K. Papineni, S. Roukos, T. Ward and W. Zhu(2002-07) Bleu: a method for automatic evaluation of machine translation. Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: 4.1.
[26] D. E. Rumelhart, G. E. Hinton and R. J. Williams(1988) Neurocomputing: foundations of research. in J. A. Anderson and E. Rosenfeld (Eds.), pp. 696–699. External Links: ISBN 0-262-01097-6, Link Cited by: 3.1.
[27] N. A. Smith and J. Eisner(2005-06) Contrastive estimation: training log-linear models on unlabeled data. Ann Arbor, Michigan, pp. 354–362. External Links: Link, Document Cited by: 3.2.
[28] R. Socher, C. C. Lin, A. Y. Ng and C. D. Manning(2011) Parsing Natural Scenes and Natural Language with Recursive Neural Networks. Cited by: 2.
[29] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng and C. D. Manning(2011-07) Semi-supervised recursive autoencoders for predicting sentiment distributions. Edinburgh, Scotland, UK., pp. 151–161. External Links: Link Cited by: 2.
[30] J. Su, H. Wu, H. Wang, Y. Chen, X. Shi, H. Dong and Q. Liu(2012-07) Translation model adaptation for statistical machine translation with monolingual topic information. Jeju Island, Korea, pp. 459–468. External Links: Link Cited by: 1, 5.
[31] Y. Tam, I. Lane and T. Schultz(2007-12) Bilingual lsa-based adaptation for statistical machine translation. Machine Translation 21 (4), pp. 187–207. External Links: ISSN 0922-6567, Link, Document Cited by: 1, 5.
[32] P. Vincent, H. Larochelle, Y. Bengio and P. Manzagol(2008) Extracting and composing robust features with denoising autoencoders. ICML ’08, New York, NY, USA, pp. 1096–1103. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: 3.1.
[33] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P. Manzagol(2010-12) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, pp. 3371–3408. External Links: ISSN 1532-4435, Link Cited by: 3.1.
[34] X. Xiao, D. Xiong, M. Zhang, Q. Liu and S. Lin(2012-07) A topic similarity model for hierarchical phrase-based translation. Jeju Island, Korea, pp. 750–758. External Links: Link Cited by: 1, 4.2, 4.4, 4.4, 2, 5, 5.
[35] D. Xiong, M. Zhang, A. Aw and H. Li(2009-08) A syntax-driven bracketing model for phrase-based translation. Suntec, Singapore, pp. 315–323. External Links: Link Cited by: 1.
[36] D. Xiong and M. Zhang(2013) A topic-based coherence model for statistical machine translation. See , Cited by: 1.
[37] B. Zhao and E. P. Xing(2006-07) BiTAM: bilingual topic admixture models for word alignment. Sydney, Australia, pp. 969–976. External Links: Link Cited by: 1, 5.
[38] B. Zhao and E. P. Xing(2007) HM-bitam: bilingual topic exploration, word alignment, and translation. in J.C. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, pp. 1689–1696. Cited by: 1, 5.

Generated on Tue Jun 10 17:17:16 2014 by LaTeXML [LOGO]

Learning Topic Representation for SMT with Neural Networks††thanks: This work was done while the first and fourth authors were visiting Microsoft Research.