Statistical Machine Translation (SMT) usually utilizes contextual information to disambiguate translation candidates. However, it is often limited to contexts within sentence boundaries, hence broader topical information cannot be leveraged. In this paper, we propose a novel approach to learning topic representation for parallel data using a neural network architecture, where abundant topical contexts are embedded via topic relevant monolingual data. By associating each translation rule with the topic representation, topic relevant rules are selected according to the distributional similarity with the source text during SMT decoding. Experimental results show that our method significantly improves translation accuracy in the NIST Chinese-to-English translation task compared to a state-of-the-art baseline.
1]LeiCui2]DongdongZhang2]ShujieLiu3]QimingChen2]MuLi2]MingZhou1]MuyunYang\affil[1]SchoolofComputerScienceandTechnology,HarbinInstituteofTechnology,Harbin,P.R.China\authorcrleicui@hit.edu.cn,ymy@mtlab.hit.edu.cn\affil[2]MicrosoftResearch,Beijing,P.R.China\authorcr{dozhang,shujliu,muli,mingzhou}@microsoft.com\affil[3]ShanghaiJiaoTongUniversity,Shanghai,P.R.China\authorcrsimoncqm@gmail.com
UTF8gkai
Making translation decisions is a difficult task in many Statistical Machine Translation (SMT) systems. Current translation modeling approaches usually use context dependent information to disambiguate translation candidates. For example, translation sense disambiguation approaches [4, 5] are proposed for phrase-based SMT systems. Meanwhile, for hierarchical phrase-based or syntax-based SMT systems, there is also much work involving rich contexts to guide rule selection [15, 21, 23, 35]. Although these methods are effective and proven successful in many SMT systems, they only leverage within-sentence contexts which are insufficient in exploring broader information. For example, the word driver often means “the operator of a motor vehicle” in common texts. But in the sentence “Finally, we write the user response to the buffer, i.e., pass it to our driver”, we understand that driver means “computer program”. In this case, people understand the meaning because of the IT topical context which goes beyond sentence-level analysis and requires more relevant knowledge. Therefore, it is important to leverage topic information to learn smarter translation models and achieve better translation performance.
Topic modeling is a useful mechanism for discovering and characterizing various semantic concepts embedded in a collection of documents. Attempts on topic-based translation modeling include topic-specific lexicon translation models [37, 38], topic similarity models for synchronous rules [34], and document-level translation with topic coherence [36]. In addition, topic-based approaches have been used in domain adaptation for SMT [31, 30], where they view different topics as different domains. One typical property of these approaches in common is that they only utilize parallel data where document boundaries are explicitly given. In this way, the topic of a sentence can be inferred with document-level information using off-the-shelf topic modeling toolkits such as Latent Dirichlet Allocation (LDA) [3] or Hidden Topic Markov Model (HTMM) [14]. Most of them also assume that the input must be in document level. However, this situation does not always happen since there is considerable amount of parallel data which does not have document boundaries. In addition, contemporary SMT systems often works on sentence level rather than document level due to the efficiency. Although we can easily apply LDA at the sentence level, it is quite difficult to infer the topic accurately with only a few words in the sentence. This makes previous approaches inefficient when applied them in real-world commercial SMT systems. Therefore, we need to devise a systematical approach to enriching the sentence and inferring its topic more accurately.
In this paper, we propose a novel approach to learning topic representations for sentences. Since the information within the sentence is insufficient for topic modeling, we first enrich sentence contexts via Information Retrieval (IR) methods using content words in the sentence as queries, so that topic-related monolingual documents can be collected. These topic-related documents are utilized to learn a specific topic representation for each sentence using a neural network based approach. Neural network is an effective technique for learning different levels of data representations. The levels inferred from neural network correspond to distinct levels of concepts, where high-level representations are obtained from low-level bag-of-words input. It is able to detect correlations among any subset of input features through non-linear transformations, which demonstrates the superiority of eliminating the effect of noisy words which are irrelevant to the topic. Our problem fits well into the neural network framework and we expect that it can further improve inferring the topic representations for sentences.
To incorporate topic representations as translation knowledge into SMT, our neural network based approach directly optimizes similarities between the source language and target language in a compact topic space. This underlying topic space is learned from sentence-level parallel data in order to share topic information across the source and target languages as much as possible. Additionally, our model can be discriminatively trained with a large number of training instances, without expensive sampling methods such as in LDA or HTMM, thus it is more practicable and scalable. Finally, we associate the learned representation to each bilingual translation rule. Topic-related rules are selected according to distributional similarity with the source text, which helps hypotheses generation in SMT decoding. We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task. Experimental results demonstrate that our model significantly improves translation accuracy over a state-of-the-art baseline.
Deep learning is an active topic in recent years which has triumphed in many machine learning research areas. This technique began raising public awareness in the mid-2000s after researchers showed how a multi-layer feed-forward neural network can be effectively trained. The training procedure often involves two phases: a layer-wise unsupervised pre-training phase and a supervised fine-tuning phase. For pre-training, Restricted Boltzmann Machine (RBM) [16], auto-encoding [1] and sparse coding [20] are most frequently used. Unsupervised pre-training trains the network one layer at a time and helps guide the parameters of the layer towards better regions in parameter space [2]. Followed by fine-tuning in this parameter region, deep learning is able to achieve state-of-the-art performance in various research areas, including breakthrough results on the ImageNet dataset for objective recognition [19], significant error reduction in speech recognition [10], etc.
Deep learning has also been successfully applied in a variety of NLP tasks such as part-of-speech tagging, chunking, named entity recognition, semantic role labeling [8], parsing [28], sentiment analysis [29], etc. Most NLP research converts a high-dimensional and sparse binary representation into a low-dimensional and real-valued representation. This low-dimensional representation is usually learned from huge amount of monolingual texts in the pre-training phase, and then fine-tuned towards task-specific criterion. Inspired by previous successful research, we first learn sentence representations using topic-related monolingual texts in the pre-training phase, and then optimize the bilingual similarity by leveraging sentence-level parallel data in the fine-tuning phase.
In this section, we explain our neural network based topic similarity model in detail, as well as how to incorporate the topic similarity features into SMT decoding procedure. Figure 1 sketches the high-level overview which illustrates how to learn topic representations using sentence-level parallel data. Given a parallel sentence pair , the first step is to treat and as queries, and use IR methods to retrieve relevant documents to enrich contextual information for them. Specifically, the ranking model we used is a Vector Space Model (VSM), where the query and document are converted into tf-idf weighted vectors. The most relevant documents and are retrieved and converted to a high-dimensional, bag-of-words input f and e for the representation learning11We use f and e to denote the -of- vector converted from the retrieved documents..
There are two phases in our neural network training process: pre-training and fine-tuning. In the pre-training phase (Section 3.1), we build two neural networks with the same structure but different parameters to learn a low-dimensional representation for sentences in two different languages. Then, in the fine-tuning phase (Section 3.2), our model directly optimizes the similarity of two low-dimensional representations, so that it highly correlates to SMT decoding. Finally, the learned representation is used to calculate similarities which are integrated as features in SMT decoding procedure (Section 3.3).
In the pre-training phase, we leverage neural network structures to transform high-dimensional sparse vectors to low-dimensional dense vectors. The topic similarity is calculated on top of the learned dense vectors. This dense representation should preserve the information from the bag-of-words input, meanwhile alleviate data sparse problem. Therefore, we use a specially designed mechanism called auto-encoder to solve this problem. Auto-encoder [1] is one of the basic building blocks of deep learning. Assuming that the input is a -of- binary vector x representing the bag-of-words ( is the vocabulary size), an auto-encoder consists of an encoding process and a decoding process . The objective of the auto-encoder is to minimize the reconstruction error . Our goal is to learn a low-dimensional vector which can preserve information from the original -of- vector.
One problem with auto-encoder is that it treats all words in the same way, making no distinguishment between function words and content words. The representation learned by auto-encoders tends to be influenced by the function words, thereby it is not robust. To alleviate this problem, Vincent et al. (2008) proposed the Denoising Auto-Encoder (DAE), which aims to reconstruct a clean, “repaired” input from a corrupted, partially destroyed vector. This is done by corrupting the initial input x to get a partially destroyed version . DAE is capable of capturing the global structure of the input while ignoring the noise. In our task, for each sentence, we treat the retrieved relevant documents as a single large document and convert it to a bag-of-words vector x in Figure 2. With DAE, the input x is manually corrupted by applying masking noise (randomly mask 1 to 0) and getting . Denoising training is considered as “filling in the blanks” [33], which means the masking components can be recovered from the non-corrupted components. For example, in IT related texts, if the word driver is masked, it should be predicted through hidden units in neural networks by active signals such as “buffer”, “user response”, etc.
In our case, the encoding process transforms the corrupted input into with two layers: a linear layer connected with a non-linear layer. Assuming that the dimension of the is , the linear layer forms a matrix which projects the -of- vector to a -dimensional hidden layer. After the bag-of-words input has been transformed, they are fed into a subsequent layer to model the highly non-linear relations among words:
(1) |
where z is the output of the non-linear layer, b is a -length bias vector. is a non-linear function, where common choices include sigmoid function, hyperbolic function, “hard” hyperbolic function, rectifier function, etc. In this work, we use the rectifier function as our non-linear function due to its efficiency and better performance [13]:
(2) |
The decoding process consists of a linear layer and a non-linear layer with similar network structures, but different parameters. It transforms the -dimensional vector to a -dimensional vector . To minimize reconstruction error with respect to , we define the loss function as the L2-norm of the difference between the uncorrupted input and reconstructed input:
(3) |
Multi-layer neural networks are trained with the standard back-propagation algorithm [26]. The gradient of the loss function is calculated and back-propagated to the previous layer to update its parameters. Training neural networks involves many factors such as the learning rate and the length of hidden layers. We will discuss the optimization of these parameters in Section 4.
In the fine-tuning phase, we stack another layer on top of the two low-dimensional vectors to maximize the similarity between source and target languages. The similarity scores are integrated into the standard log-linear model for making translation decisions. Since the vectors from DAE are trained using information from monolingual training data independently, these vectors may be inadequate to measure bilingual topic similarity due to their different topic spaces. Therefore, in this stage, parallel sentence pairs are used to help connecting the vectors from different languages because they express the same topic. In fact, the objective of fine-tuning is to discover a latent topic space which is shared by both languages as much as possible. This shared topic space is particularly useful when the SMT decoder tries to match the source texts and translation candidates in the target language.
Given a parallel sentence pair , the DAE learns representations for and respectively, as and in Figure 1. We then take two vectors as the input to calculate their similarity. Consequently, the whole neural network can be fine-tuned towards the supervised criteria with the help of parallel data. The similarity score of the representation pair is defined as the cosine similarity of the two vectors:
(4) |
Since a parallel sentence pair should have the same topic, our goal is to maximize the similarity score between the source sentence and target sentence. Inspired by the contrastive estimation method [27], for each parallel sentence pair as a positive instance, we select another sentence pair from the training data and treat as a negative instance. To make the similarity of the positive instance larger than the negative instance by some margin , we utilize the following pairwise ranking loss:
(5) |
where . The rationale behind this criterion is, the smaller is, the more we should penalize negative instances.
To effectively train the model in this task, negative instances must be selected carefully. Since different sentences may have very similar topic distributions, we select negative instances that are dissimilar with the positive instances based on the following criteria:
For each positive instance , we select which contains at least 30% different content words from .
If we cannot find such , remove from the training instances for network learning.
The model minimizes the pairwise ranking loss across all training instances:
(6) |
We used standard back-propagation algorithm to further fine-tune the neural network parameters and b in Equation (1). The learned neural networks are used to obtain sentence topic representations, which will be further leveraged to infer topic representations of bilingual translation rules.
We incorporate the learned topic similarity scores into the standard log-linear framework for SMT. When a synchronous rule is extracted from a sentence pair , a triple instance is collected for inferring the topic representation of , where is the count of rule occurrence. Following [7], we give a count of one for each phrase pair occurrence and a fractional count for each hierarchical phrase pair. The topic representation of is then calculated as the weighted average:
(7) |
(8) |
where denotes all instances for the rule , and are the source-side and target-side topic vectors respectively.
By measuring the similarity between the source texts and bilingual translation rules, the SMT decoder is able to encourage topic relevant translation candidates and penalize topic irrelevant candidates. Therefore, it helps to train a smarter translation model with the embedded topic information. Given a source sentence to be translated, we define the similarity as follows:
(9) |
(10) |
where is the topic representation of . The similarity calculated against or denotes the source-to-source or the source-to-target similarity.
We also consider the topic sensitivity estimation since general rules have flatter distributions while topic-specific rules have sharper distributions. A standard entropy metric is used to measure the sensitivity of the source-side of as:
(11) |
where is a component in the vector . The target-side sensitivity can be calculated in a similar way. The larger the sensitivity is, the more topic-specific the rule manifests.
In addition to traditional SMT features, we add new topic-related features into the standard log-linear framework. For the SMT system, the best translation candidate is given by:
(12) |
where the translation probability is given by:
(13) |
where is the standard feature function and is the corresponding feature weight. is the topic-related feature function and is the feature weight. The detailed feature description is as follows:
Standard features: Translation model, including translation probabilities and lexical weights for both directions (4 features), 5-gram language model (1 feature), word count (1 feature), phrase count (1 feature), NULL penalty (1 feature), number of hierarchical rules used (1 feature).
Topic-related features: rule similarity scores (2 features), rule sensitivity scores (2 features).
We evaluate the performance of our neural network based topic similarity model on a Chinese-to-English machine translation task. In neural network training, a large number of monolingual documents are collected in both source and target languages. The documents are mainly from two domains: news and weblog. We use Chinese and English Gigaword corpus (Version 5) which are mainly from news domain. In addition, we also collect weblog documents with a variety of topics from the web. The total data statistics are presented in Table 1. These documents are built in the format of inverted index using Lucene22http://lucene.apache.org/, which can be efficiently retrieved by the parallel sentence pairs. The most relevant documents are collected, where we experiment with .
Domain | Chinese | English | ||
---|---|---|---|---|
Docs | Words | Docs | Words | |
News | 5.7M | 5.4B | 9.9M | 25.6B |
Weblog | 2.1M | 8B | 1.2M | 2.9B |
Total | 7.8M | 13.4B | 11.1M | 28.5B |
We implement a distributed framework to speed up the training process of neural networks. The network is learned with mini-batch asynchronous gradient descent with the adaptive learning rate procedure called AdaGrad [11]. We use 32 model replicas in each iteration during the training. The model parameters are averaged after each iteration and sent to each replica for the next iteration. The vocabulary size for the input layer is 100,000, and we choose different lengths for the hidden layer as in the experiments. In the pre-training phase, all parallel data is fed into two neural networks respectively for DAE training, where network parameters and b are randomly initialized. In the fine-tuning phase, for each parallel sentence pair, we randomly select other ten sentence pairs which satisfy the criterion as negative instances. These training instances are leveraged to optimize the similarity of two vectors.
In SMT training, an in-house hierarchical phrase-based SMT decoder is implemented for our experiments. The CKY decoding algorithm is used and cube pruning is performed with the same default parameter settings as in Chiang (2007). The parallel data we use is released by LDC33LDC2003E14, LDC2002E18, LDC2003E07, LDC2005T06, LDC2005T10, LDC2005E83, LDC2006E34, LDC2006E85, LDC2006E92, LDC2006E26, LDC2007T09. In total, the datasets contain nearly 1.1 million sentence pairs. Translation models are trained over the parallel data that is automatically word-aligned using GIZA++ in both directions, and the diag-grow-final heuristic is used to refine symmetric word alignment. An in-house language modeling toolkit is used to train the 5-gram language model with modified Kneser-Ney smoothing [17]. The English monolingual data used for language modeling is the same as in Table 1. The NIST 2003 dataset is the development data. The testing data consists of NIST 2004, 2005, 2006 and 2008 datasets. The evaluation metric for the overall translation quality is case-insensitive BLEU4 [25]. The reported BLEU scores are averaged over 5 times of running MERT [24]. A statistical significance test is performed using the bootstrap re-sampling method [18].
The baseline is a re-implementation of the Hiero system [7]. The phrase pairs that appear only once in the parallel data are discarded because most of them are noisy. We also use the fix-discount method in Foster et al. (2006) for phrase table smoothing. This implementation makes the system perform much better and the translation model size is much smaller.
We compare our method with the LDA-based approach proposed by Xiao et al. (2012). In [34], the topic of each sentence pair is exactly the same as the document it belongs to. Since some of our parallel data does not have document-level information, we rely on the IR method to retrieve the most relevant document and simulate this approach. The PLDA toolkit [22] is used to infer topic distributions, which takes 34.5 hours to finish.
We illustrate the relationship among translation accuracy (BLEU), the number of retrieved documents () and the length of hidden layers () on different testing datasets. The results are shown in Figure 3. The best translation accuracy is achieved when =10 for most settings. This confirms that enriching the source text with topic-related documents is very useful in determining topic representations, thereby help to guide the synchronous rule selection. However, we find that as becomes larger in the experiments, e.g. =50, the translation accuracy drops drastically. As more documents are retrieved, less relevant information is also used to train the neural networks. Irrelevant documents bring so many unrelated topic words hence degrade neural network learning performance.
Another important factor is the length of hidden layers in the network. In deep learning, this parameter is often empirically tuned with human efforts. As shown in Figure 3, the translation accuracy is better when is relatively small. Actually, there is no obvious distinction of the performance when is less than 600. However, when equals 1,000, the translation accuracy is inferior to other settings. The main reason is that parameters in the neural networks are too many to be effectively trained. As we know when =1000, there are a total of parameters between the linear and non-linear layers in the network. Limited training data prevents the model from getting close to the global optimum. Therefore, the model is likely to fall in local optima and lead to unacceptable representations.
Settings | NIST 2004 | NIST 2005 | NIST 2006 | NIST 2008 | Average |
---|---|---|---|---|---|
Baseline | 42.25 | 41.21 | 38.05 | 31.16 | 38.17 |
[34] | 42.58 | 41.61 | 38.39 | 31.58 | 38.54 |
Sim(Src) | 42.51 | 41.55 | 38.53 | 31.57 | 38.54 |
Sim(Trg) | 42.43 | 41.48 | 38.4 | 31.49 | 38.45 |
Sim(Src+Trg) | 42.7 | 41.66 | 38.66 | 31.66 | 38.67 |
Sim(Src+Trg)+Sen(Src) | 42.77 | 41.81 | 38.85 | 31.73 | 38.79 |
Sim(Src+Trg)+Sen(Trg) | 42.85 | 41.79 | 38.76 | 31.7 | 38.78 |
Sim(Src+Trg)+Sen(Src+Trg) | 42.95 | 41.97 | 38.91 | 31.88 | 38.93 |
We evaluate the performance of adding new topic-related features to the log-linear model and compare the translation accuracy with the method in [34]. To make different methods comparable, we set the dimension of topic representation as 100 for all settings. This takes 10 hours in pre-training phase and 22 hours in fine-tuning phase. Table 2 shows how the accuracy is improved with more features added. The results confirm that topic information is indispensable for SMT since both [34] and our neural network based method significantly outperforms the baseline system. Our method improves 0.86 BLEU points at most and 0.76 BLEU points on average over the baseline. We observe that source-side similarity is more effective than target-side similarity, but their contributions are cumulative. This proves that bilingually induced topic representation with neural network helps the SMT system disambiguate translation candidates. Furthermore, rule sensitivity features improve SMT performance compared with only using similarity features. Because topic-specific rules usually have a larger sensitivity score, they can beat general rules when they obtain the same similarity score against the input sentence. Finally, when all new features are integrated, the performance is the best, preforming substantially better than [34] with 0.39 BLEU points on average.
It is worth mentioning that the performance of [34] is similar to the settings with =1 and =100 in Figure 3. This is not simply coincidence since we can interpret their approach as a special case in our neural network method: when a parallel sentence pair has document-level information, that document will be retrieved for training; otherwise, the most relevant document will be retrieved from the monolingual data. Therefore, our method can be viewed as a more general framework than previous LDA-based approaches.
In this section, we give a case study to explain why our method works. An example of translation rule disambiguation for a sentence from the NIST 2005 dataset is shown in Figure 4. We find that the topic of this sentence is about “rescue after a natural disaster”. Under this topic, the Chinese rule “åé X” should be translated to “deliver X” or “distribute X”. However, the baseline system prefers “send X” rather than those two candidates. Although the translation probability of “send X” is much higher, it is inappropriate in this context since it is usually used in IT texts. For example, åéé®ä»¶, send emails, åéä¿¡æ¯, send messages and åéæ°æ®, send data. In contrast, with our neural network based approach, the learned topic distributions of “deliver X” or “distribute X” are more similar with the input sentence than “send X”, which is shown in Figure 4. The similarity scores indicate that “deliver X” and “distribute X” are more appropriate to translate the sentence. Therefore, adding topic-related features is able to keep the topic consistency and substantially improve the translation accuracy.
Topic modeling was first leveraged to improve SMT performance in [37, 38]. They proposed a bilingual topical admixture approach for word alignment and assumed that each word-pair follows a topic-specific model. They reported extensive empirical analysis and improved word alignment accuracy as well as translation quality. Following this work, [34] extended topic-specific lexicon translation models to hierarchical phrase-based translation models, where the topic information of synchronous rules was directly inferred with the help of document-level information. Experiments show that their approach not only achieved better translation performance but also provided a faster decoding speed compared with previous lexicon-based LDA methods.
Another direction of approaches leveraged topic modeling techniques for domain adaptation. Tam et al. (2007) used bilingual LSA to learn latent topic distributions across different languages and enforce one-to-one topic correspondence during model training. They incorporated the bilingual topic information into language model adaptation and lexicon translation model adaptation, achieving significant improvements in the large-scale evaluation. [30] investigated the relationship between out-of-domain bilingual data and in-domain monolingual data via topic mapping using HTMM methods. They estimated phrase-topic distributions in translation model adaptation and generated better translation quality. Recently, Chen et al. (2013) proposed using vector space model for adaptation where genre resemblance is leveraged to improve translation accuracy. We also investigated multi-domain adaptation where explicit topic information is used to train domain specific models [9].
Generally, most previous research has leveraged conventional topic modeling techniques such as LDA or HTMM. In our work, a novel neural network based approach is proposed to infer topic representations for parallel data. The advantage of our method is that it is applicable to both sentence-level and document-level SMT, since we do not place any restrictions on the input. In addition, our method directly maximizes the similarity between parallel sentence pairs, which is ideal for SMT decoding. Compared to document-level topic modeling which uses the topic of a document for all sentences within the document [34], our contributions are:
We proposed a more general approach to leveraging topic information for SMT by using IR methods to get a collection of related documents, regardless of whether or not document boundaries are explicitly given.
We used neural networks to learn topic representations more accurately, with more practicable and scalable modeling techniques.
We directly optimized bilingual topic similarity in the deep learning framework with the help of sentence-level parallel data, so that the learned representation could be easily used in SMT decoding procedure.
In this paper, we propose a neural network based approach to learning bilingual topic representation for SMT. We enrich contexts of parallel sentence pairs with topic related monolingual data and obtain a set of documents to represent sentences. These documents are converted to a bag-of-words input and fed into neural networks. The learned low-dimensional vector is used to obtain the topic representations of synchronous rules. In SMT decoding, appropriate rules are selected to best match source texts according to their similarity in the topic space. Experimental results show that our approach is promising for SMT systems to learn a better translation model. It is a significant improvement over the state-of-the-art Hiero system, as well as a conventional LDA-based method.
In the future research, we will extend our neural network methods to address document-level translation, where topic transition between sentences is a crucial problem to be solved. Since the translation of the current sentence is usually influenced by the topic of previous sentences, we plan to leverage recurrent neural networks to model this phenomenon, where the history translation information is naturally combined in the model.
We are grateful to the anonymous reviewers for their insightful comments. We also thank Fei Huang (BBN), Nan Yang, Yajuan Duan, Hong Sun and Duyu Tang for the helpful discussions. This work is supported by the National Natural Science Foundation of China (Granted No. 61272384)