We propose Bilingually-constrained Recursive Auto-encoders (BRAE) to learn semantic phrase embeddings (compact vector representations for phrases), which can distinguish the phrases with different semantic meanings. The BRAE is trained in a way that minimizes the semantic distance of translation equivalents and maximizes the semantic distance of non-translation pairs simultaneously. After training, the model learns how to embed each phrase semantically in two languages and also learns how to transform semantic embedding space in one language to the other. We evaluate our proposed method on two end-to-end SMT tasks (phrase table pruning and decoding with phrasal semantic similarities) which need to measure semantic similarity between a source phrase and its translation candidates. Extensive experiments show that the BRAE is remarkably effective in these two tasks.
Due to the powerful capacity of feature learning and representation, Deep (multi-layer) Neural Networks (DNN) have achieved a great success in speech and image processing [13, 15, 6].
Recently, statistical machine translation (SMT) community has seen a strong interest in adapting and applying DNN to many tasks, such as word alignment [29], translation confidence estimation [19, 18, 31], phrase reordering prediction [16], translation modelling [1, 12] and language modelling [7, 26]. Most of these works attempt to improve some components in SMT based on word embedding, which converts a word into a dense, low dimensional, real-valued vector representation [2, 3, 5, 20].
However, in the conventional (phrase-based) SMT, phrases are the basic translation units. The models using word embeddings as the direct inputs to DNN cannot make full use of the whole syntactic and semantic information of the phrasal translation rules. Therefore, in order to successfully apply DNN to model the whole translation process, such as modelling the decoding process, learning compact vector representations for the basic phrasal translation units is the essential and fundamental work.
In this paper, we explore the phrase embedding, which represents a phrase (sequence of words) with a real-valued vector. In some previous works, phrase embedding has been discussed from different views. Socher et al. (2011) make the phrase embeddings capture the sentiment information. Socher et al. (2013) enable the phrase embeddings to mainly capture the syntactic knowledge. Li et al. (2013) attempt to encode the reordering pattern in the phrase embeddings. Kalchbrenner and Blunsom (2013) utilize a simple convolution model to generate phrase embeddings from word embeddings. Mikolov et al. (2013) consider a phrase as an indivisible -gram. Obviously, these methods of learning phrase embeddings either focus on some aspects of the phrase (e.g. reordering pattern), or impose strong assumptions (e.g. bag-of-words or indivisible -gram). Therefore, these phrase embeddings are not suitable to fully represent the phrasal translation units in SMT due to the lack of semantic meanings of the phrase.
Instead, we focus on learning phrase embeddings from the view of semantic meaning, so that our phrase embedding can fully represent the phrase and best fit the phrase-based SMT. Assuming the phrase is a meaningful composition of its internal words, we propose Bilingually-constrained Recursive Auto-encoders (BRAE) to learn semantic phrase embeddings. The core idea behind is that a phrase and its correct translation should share the same semantic meaning. Thus, they can supervise each other to learn their semantic phrase embeddings. Similarly, non-translation pairs should have different semantic meanings, and this information can also be used to guide learning semantic phrase embeddings.
In our method, the standard recursive auto-encoder (RAE) pre-trains the phrase embedding with an unsupervised algorithm by minimizing the reconstruction error [22], while the bilingually-constrained model learns to fine-tune the phrase embedding by minimizing the semantic distance between translation equivalents and maximizing the semantic distance between non-translation pairs.
We use an example to explain our model. As illustrated in Fig. 1, the Chinese phrase on the left and the English phrase on the right are translations with each other. If we learn the embedding of the Chinese phrase correctly, we can regard it as the gold representation for the English phrase and use it to guide the process of learning English phrase embedding. In the other direction, the Chinese phrase embedding can be learned in the same way. This procedure can be performed with an co-training style algorithm so as to minimize the semantic distance between the translation equivalents 11For simplicity, we do not show non-translation pairs here.. In this way, the result Chinese and English phrase embeddings will capture the semantics as much as possible. Furthermore, a transformation function between the Chinese and English semantic spaces can be learned as well.
With the learned model, we can accurately measure the semantic similarity between a source phrase and a translation candidate. Accordingly, we evaluate the BRAE model on two end-to-end SMT tasks (phrase table pruning and decoding with phrasal semantic similarities) which need to check whether a translation candidate and the source phrase are in the same meaning. In phrase table pruning, we discard the phrasal translation rules with low semantic similarity. In decoding with phrasal semantic similarities, we apply the semantic similarities of the phrase pairs as new features during decoding to guide translation candidate selection. The experiments show that up to 72% of the phrase table can be discarded without significant decrease on the translation quality, and in decoding with phrasal semantic similarities up to 1.7 BLEU score improvement over the state-of-the-art baseline can be achieved.
In addition, our semantic phrase embeddings have many other potential applications. For instance, the semantic phrase embeddings can be directly fed to DNN to model the decoding process. Besides SMT, the semantic phrase embeddings can be used in other cross-lingual tasks (e.g. cross-lingual question answering) and monolingual applications such as textual entailment, question answering and paraphrase detection.
Recently, phrase embedding has drawn more and more attention. There are three main perspectives handling this task in monolingual languages.
One method considers the phrases as bag-of-words and employs a convolution model to transform the word embeddings to phrase embeddings [4, 12]. Gao et al. (2013) also use bag-of-words but learn BLEU sensitive phrase embeddings. This kind of approaches does not take the word order into account and loses much information. Instead, our bilingually-constrained recursive auto-encoders not only learn the composition mechanism of generating phrases from words, but also fine tune the word embeddings during the model training stage, so that we can induce the full information of the phrases and internal words.
Another method [20] deals with the phrases having a meaning that is not a simple composition of the meanings of its individual words, such as New York Times. They first find the phrases of this kind. Then, they regard these phrases as indivisible units, and learn their embeddings with the context information. However, this kind of phrase embedding is hard to capture full semantics since the context of a phrase is limited. Furthermore, this method can only account for a very small part of phrases, since most of the phrases are compositional. In contrast, our method attempts to learn the semantic vector representation for any phrase.
The third method views any phrase as the meaningful composition of its internal words. The recursive auto-encoder is typically adopted to learn the way of composition [22, 23, 21, 24, 16]. They pre-train the RAE with an unsupervised algorithm. And then, they fine-tune the RAE according to the label of the phrase, such as the syntactic category in parsing (Socher et al., 2013a), the polarity in sentiment analysis [23, 24], and the reordering pattern in SMT [16]. This kind of semi-supervised phrase embedding is in fact performing phrase clustering with respect to the phrase label. For example, in the RAE-based phrase reordering model for SMT [16], the phrases with the similar reordering tendency (e.g. monotone or swap) are close to each other in the embedding space, such as the prepositional phrases. Obviously, this kind methods of semi-supervised phrase embedding do not fully address the semantic meaning of the phrases. Although we also follow the composition-based phrase embedding, we are the first to focus on the semantic meanings of the phrases and propose a bilingually-constrained model to induce the semantic information and learn transformation of the semantic space in one language to the other.
This section introduces the Bilingually-constrained Recursive Auto-encoders (BRAE), that is inspired by two observations. First, the recursive auto-encoder provides a reasonable composition mechanism to embed each phrase. And the semi-supervised phrase embedding [23, 21, 16] further indicates that phrase embedding can be tuned with respect to the label. Second, even though we have no correct semantic phrase representation as the gold label, the phrases sharing the same meaning provide an indirect but feasible way.
We will first briefly present the unsupervised phrase embedding, and then describe the semi-supervised framework. After that, we introduce the BRAE on the network structure, objective function and parameter inference.
In phrase embedding using composition, the word vector representation is the basis and serves as the input to the neural network. After learning word embeddings with DNN [2, 5, 20], each word in the vocabulary corresponds to a vector , and all the vectors are stacked into an embedding matrix .
Given a phrase which is an ordered list of words, each word has an index into the columns of the embedding matrix . The index is used to retrieve the word’s vector representation using a simple multiplication with a binary vector which is zero in all positions except for the th index:
(1) |
Note that is usually set empirically, such as . Throughout this paper, is used for better illustration as shown in Fig. 1.
Assuming we are given a phrase , it is first projected into a list of vectors using Eq. 1. The RAE learns the vector representation of the phrase by recursively combining two children vectors in a bottom-up manner [23]. Fig. 2 illustrates an instance of a RAE applied to a binary tree, in which a standard auto-encoder (in box) is re-used at each node. The standard auto-encoder aims at learning an abstract representation of its input. For two children and , the auto-encoder computes the parent vector as follows:
(2) |
Where we multiply the parameter matrix by the concatenation of two children . After adding a bias term , we apply an element-wise activation function such as , which is used in our experiments. In order to apply this auto-encoder to each pair of children, the representation of the parent should have the same dimensionality as the ’s.
To assess how well the parent’s vector represents its children, the standard auto-encoder reconstructs the children in a reconstruction layer:
(3) |
Where and are reconstructed children, and are parameter matrix and bias term for reconstruction respectively, and .
To obtain the optimal abstract representation of the inputs, the standard auto-encoder tries to minimize the reconstruction errors between the inputs and the reconstructed ones during training:
(4) |
Given , we can use Eq. 2 again to compute by setting the children to be . The same auto-encoder is re-used until the vector of the whole phrase is generated.
For unsupervised phrase embedding, the only objective is to minimize the sum of reconstruction errors at each node in the optimal binary tree:
(5) |
Where is the list of vectors of a phrase, and denotes all the possible binary trees that can be built from inputs . A greedy algorithm (Socher et al., 2011) is used to generate the optimal binary tree . The parameters are optimized over all the phrases in the training data.
The above RAE is completely unsupervised and can only induce general representations of the multi-word phrases. Several researchers extend the original RAEs to a semi-supervised setting so that the induced phrase embedding can predict a target label, such as polarity in sentiment analysis [23], syntactic category in parsing [21] and phrase reordering pattern in SMT [16].
In the semi-supervised RAE for phrase embedding, the objective function over a (phrase, label) pair includes the reconstruction error and the prediction error, as illustrated in Fig. 3.
(6) |
Where the hyper-parameter is used to balance the reconstruction and prediction error. For label prediction, the cross-entropy error is usually used to calculate . By optimizing the above objective, the phrases in the vector embedding space will be grouped according to the labels.
We know from the semi-supervised phrase embedding that the learned vector representation can be well adapted to the given label. Therefore, we can imagine that learning semantic phrase embedding is reasonable if we are given gold vector representations of the phrases.
However, no gold semantic phrase embedding exists. Fortunately, we know the fact that the two phrases should share the same semantic representation if they express the same meaning. We can make inference from this fact that if a model can learn the same embedding for any phrase pair sharing the same meaning, the learned embedding must encode the semantics of the phrases and the corresponding model is our desire.
As translation equivalents share the same semantic meaning, we employ high-quality phrase translation pairs as training corpus in this work. Accordingly, we propose the Bilingually-constrained Recursive Auto-encoders (BRAE), whose basic goal is to minimize the semantic distance between the phrases and their translations.
Unlike previous methods, the BRAE model jointly learns two RAEs (Fig. 4 shows the network structure): one for source language and the other for target language. For a phrase pair , two kinds of errors are involved:
1. reconstruction error : how well the learned vector representations and represent the phrase and respectively?
(7) |
2. semantic error : what is the semantic distance between the learned vector representations and ?
Since word embeddings for two languages are learned separately and locate in different vector space, we do not enforce the phrase embeddings in two languages to be in the same semantic vector space. We suppose there is a transformation between the two semantic embedding spaces. Thus, the semantic distance is bidirectional: the distance between and the transformation of , and that between and the transformation of . As a result, the overall semantic error becomes:
(8) |
Where means the transformation of is performed as follows: we first multiply a parameter matrix by , and after adding a bias term we apply an element-wise activation function . Finally, we calculate their Euclidean distance:
(9) |
can be calculated in exactly the same way. For the phrase pair , the joint error is:
(10) |
The hyper-parameter weights the reconstruction and semantic error. The final BRAE objective over the phrase pairs training set becomes:
(11) |
Ideally, we want the learned BRAE model can make sure that the semantic error for the positive example (a source phrase and its correct translation ) is much smaller than that for the negative example (the source phrase and a bad translation ). However, the current model cannot guarantee this since the above semantic error only accounts for positive ones.
We thus enhance the semantic error with both positive and negative examples, and the corresponding max-semantic-margin error becomes:
(12) |
It tries to minimize the semantic distance between translation equivalents and maximize the semantic distance between non-translation pairs simultaneously. Using the above error function, we need to construct a negative example for each positive example. Suppose we are given a positive example , the correct translation can be converted into a bad translation by replacing the words in with randomly chosen target language words. Then, a negative example is available.
Like semi-supervised RAE [16], the parameters in our BRAE model can also be divided into three sets:
: word embedding matrix for two languages (Section 3.1.1);
: recursive auto-encoder parameter matrices , , and bias terms , for two languages (Section 3.1.2);
: transformation matrix and bias term for two directions in semantic distance computation (Section 3.3.1).
To have a deep understanding of the parameters, we rewrite Eq. 10:
(13) |
We can see that the parameters can be divided into two classes: for the source language and for the target language. The above equation also indicates that the source-side parameters can be optimized independently as long as the semantic representation of the target phrase is given to compute with Eq. 9. It is similar for the target-side parameters .
Assuming the target phrase representation is available, the optimization of the source-side parameters is similar to that of semi-supervised RAE. We apply the Stochastic Gradient Descent (SGD) algorithm to optimize each parameter:
(14) |
In order to run SGD algorithm, we need to solve two problems: one for parameter initialization and the other for partial gradient calculation.
In parameter initialization, and for the source language is randomly set according to a normal distribution. For the word embedding , there are two choices. First, is initialized randomly like other parameters. Second, the word embedding matrix is pre-trained with DNN [2, 5, 20] using large-scale unlabeled monolingual data. We prefer to the second one since this kind of word embedding has already encoded some semantics of the words. In this work, we employ the toolkit Word2Vec [20] to pre-train the word embedding for the source and target languages. The word embeddings will be fine-tuned in our BRAE model to capture much more semantics.
The partial gradient for one instance is computed as follows:
(15) |
Where the source-side error given the target phrase representation includes reconstruction error and updated semantic error:
(16) |
Given the current , we first construct the binary tree (as illustrated in Fig. 2) for any source-side phrase using the greedy algorithm [23]. Then, the derivatives for the parameters in the fixed binary tree will be calculated via backpropagation through structures [10]. Finally, the parameters will be updated using Eq. 14 and a new is obtained.
The target-side parameters can be optimized in the same way as long as the source-side phrase representation is available. It seems a paradox that updating needs while updating needs . To solve this problem, we propose an co-training style algorithm which includes three steps:
1. Pre-training: applying unsupervised phrase embedding with standard RAE to pre-train the source- and target-side phrase representations and respectively (Section 2.1.2);
2. Fine-tuning: with the BRAE model, using target-side phrase representation to update the source-side parameters and obtain the fine-tuned source-side phrase representation , and meanwhile using to update and get the fine-tuned , and then calculate the joint error over the training corpus;
3. Termination Check: if the joint error reaches a local minima or the iterations reach the pre-defined number (25 is used in our experiments), we terminate the training procedure, otherwise we set , , and go to step 2.
With the semantic phrase embeddings and the vector space transformation function, we apply the BRAE to measure the semantic similarity between a source phrase and its translation candidates in the phrase-based SMT. Two tasks are involved in the experiments: phrase table pruning that discards entries whose semantic similarity is very low and decoding with the phrasal semantic similarities as additional new features.
The hyper-parameters in the BRAE model include the dimensionality of the word embedding in Eq. 1, the balance weight in Eq. 10, in Eq. 11, and the learning rate in Eq. 14.
For the dimensionality , we have tried three settings in our experiments. We empirically set the learning rate . We draw from 0.05 to 0.5 with step 0.05, and from . The overall error of the BRAE model is employed to guide the search procedure. Finally, we choose , , and .
We have implemented a phrase-based translation system with a maximum entropy based reordering model using the bracketing transduction grammar [27, 28].
The SMT evaluation is conducted on Chinese-to-English translation. Accordingly, our BRAE model is trained on Chinese and English. The bilingual training data from LDC 22LDC category numbers: LDC2000T50, LDC2002L27, LDC2003E07, LDC2003E14, LDC2004T07, LDC2005T06, LDC2005T10 and LDC2005T34. contains 0.96M sentence pairs and 1.1M entity pairs with 27.7M Chinese words and 31.9M English words. A 5-gram language model is trained on the Xinhua portion of the English Gigaword corpus and the English part of bilingual training data. The NIST MT03 is used as the development data. NIST MT04-06 and MT08 (news data) are used as the test data. Case-insensitive BLEU is employed as the evaluation metric. The statistical significance test is performed by the re-sampling approach [14].
In addition, we pre-train the word embedding with toolkit Word2Vec on large-scale monolingual data including the aforementioned data for SMT. The monolingual data contains 1.06B words for Chinese and 1.12B words for English. To obtain high-quality bilingual phrase pairs to train our BRAE model, we perform forced decoding for the bilingual training sentences and collect the phrase pairs used. After removing the duplicates, the remaining 1.12M bilingual phrase pairs (length ranging from 1 to 7) are obtained.
Pruning most of the phrase table without much impact on translation quality is very important for translation especially in environments where memory and time constraints are imposed. Many algorithms have been proposed to deal with this problem, such as significance pruning [11, 25], relevance pruning [8] and entropy-based pruning [17, 30]. These algorithms are based on corpus statistics including co-occurrence statistics, phrase pair usage and composition information. For example, the significance pruning, which is proven to be a very effective algorithm, computes the probability named p-value, that tests whether a source phrase and a target phrase co-occur more frequently in a bilingual corpus than they happen just by chance. The higher the p-value, the more likely of the phrase pair to be spurious.
Our work has the same objective, but instead of using corpus statistics, we attempt to measure the quality of the phrase pair from the view of semantic meaning. Given a phrase pair , the BRAE model first obtains their semantic phrase representations , and then transforms into target semantic space , into source semantic space . We finally get two similarities and . Phrase pairs that have a low similarity are more likely to be noise and more prone to be pruned. In experiments, we discard the phrase pair whose similarity in two directions are smaller than a threshold 33To avoid the situation that all the translation candidates for a source phrase are pruned, we always keep the first 10 best according to the semantic similarity..
Table 1 shows the comparison results between our BRAE-based pruning method and the significance pruning algorithm. We can see a common phenomenon in both of the algorithms: for the first few thresholds, the phrase table becomes smaller and smaller while the translation quality is not much decreased, but the performance jumps a lot at a certain threshold (16 for Significance pruning, 0.8 for BRAE-based one).
Specifically, the Significance algorithm can safely discard 64% of the phrase table at its threshold 12 with only 0.1 BLEU loss in the overall test. In contrast, our BRAE-based algorithm can remove 72% of the phrase table at its threshold 0.7 with only 0.06 BLEU loss in the overall evaluation. When the two algorithms using a similar portion of the phrase table 44In the future, we will compare the performance by enforcing the two algorithms to use the same portion of phrase table (35% in BRAE and 36% in Significance), the BRAE-based algorithm outperforms the Significance algorithm on all the test sets except for MT04. It indicates that our BRAE model is a good alternative for phrase table pruning. Furthermore, our model is much more intuitive because it is directly based on the semantic similarity.
Method | Threshold | PhraseTable | MT03 | MT04 | MT05 | MT06 | MT08 | ALL |
---|---|---|---|---|---|---|---|---|
Baseline | 100% | 35.81 | 36.91 | 34.69 | 33.83 | 27.17 | 34.82 | |
BRAE | 0.4 | 52% | 35.94 | 36.96 | 35.00 | 34.71 | 27.77 | 35.16 |
0.5 | 44% | 35.67 | 36.59 | 34.86 | 33.91 | 27.25 | 34.89 | |
0.6 | 35% | 35.86 | 36.71 | 34.93 | 34.63 | 27.34 | 35.05 | |
0.7 | 28% | 35.55 | 36.62 | 34.57 | 33.97 | 27.10 | 34.76 | |
0.8 | 20% | 35.06 | 36.01 | 34.13 | 33.04 | 26.66 | 34.04 | |
Significance | 8 | 48% | 35.86 | 36.99 | 34.74 | 34.53 | 27.59 | 35.13 |
12 | 36% | 35.59 | 36.73 | 34.65 | 34.17 | 27.16 | 34.72 | |
16 | 25% | 35.19 | 36.24 | 34.26 | 33.32 | 26.55 | 34.09 | |
20 | 18% | 35.05 | 36.09 | 34.02 | 32.98 | 26.37 | 33.97 |
Method | n | MT03 | MT04 | MT05 | MT06 | MT08 | ALL |
---|---|---|---|---|---|---|---|
Baseline | 35.81 | 36.91 | 34.69 | 33.83 | 27.17 | 34.82 | |
BRAE | 50 | 36.43 | 37.64 | 35.35 | 35.53 | 28.59 | |
100 | 36.45 | 37.44 | 35.58 | 35.42 | 28.57 | ||
200 | 36.34 | 37.35 | 35.78 | 34.87 | 27.84 |
Besides using the semantic similarities to prune the phrase table, we also employ them as two informative features like the phrase translation probability to guide translation hypotheses selection during decoding. Typically, four translation probabilities are adopted in the phrase-based SMT, including phrase translation probability and lexical weights in both directions. The phrase translation probability is based on co-occurrence statistics and the lexical weights consider the phrase as bag-of-words. In contrast, our BRAE model focuses on compositional semantics from words to phrases. Therefore, the semantic similarities computed using our BRAE model are complementary to the existing four translation probabilities.
The semantic similarities in two directions and are integrated into our baseline phrase-based model. In order to investigate the influence of the dimensionality of the embedding space, we have tried three different settings .
As shown in Table 2, no matter what is, the BRAE model can significantly improve the translation quality in the overall test data. The largest improvement can be up to 1.7 BLEU score (MT06 for ). It is interesting that with dimensionality growing, the translation performance is not consistently improved. We speculate that using or can already distinguish good translation candidates from bad ones.
To have a better intuition about the power of the BRAE model at learning semantic phrase embeddings, we show some examples in Table 3. Given the BRAE model and the phrase training set, we search from the set the most semantically similar English phrases for any new input English phrase.
The input phrases contain different number of words. The table shows that the unsupervised RAE can at most capture the syntactic property when the phrases are short. For example, the unsupervised RAE finds do not want for the input phrase do not agree. When the phrase becomes longer, the unsupervised RAE cannot even capture the syntactic property. In contrast, our BRAE model learns the semantic meaning for each phrase no matter whether it is short or relatively long. This indicates that the proposed BRAE model is effective at learning semantic phrase embeddings.
New Phrase | Unsupervised RAE | BRAE |
---|---|---|
military force | core force | military power |
main force | military strength | |
labor force | armed forces | |
at a meeting | to a meeting | at the meeting |
at a rate | during the meeting | |
a meeting , | at the conference | |
do not agree | one can accept | do not favor |
i can understand | will not compromise | |
do not want | not to approve | |
each people in this nation | each country regards | every citizen in this country |
each country has its | all the people in the country | |
each other , and | people all over the country |
As the semantic phrase embedding can fully represent the phrase, we can go a step further in the phrase-based SMT and feed the semantic phrase embeddings to DNN in order to model the whole translation process (e.g. derivation structure prediction). We will explore this direction in our future work. Besides SMT, the semantic phrase embeddings can be used in other cross-lingual tasks, such as cross-lingual question answering, since the semantic similarity between phrases in different languages can be calculated accurately.
In addition to the cross-lingual applications, we believe the BRAE model can be applied in many monolingual NLP tasks which depend on good phrase representations or semantic similarity between phrases, such as named entity recognition, parsing, textual entailment, question answering and paraphrase detection.
In fact, the phrases having the same meaning are translation equivalents in different languages, but are paraphrases in one language. Therefore, our model can be easily adapted to learn semantic phrase embeddings using paraphrases.
Our BRAE model still has some limitations. For example, as each node in the recursive auto-encoder shares the same weight matrix, the BRAE model would become weak at learning the semantic representations for long sentences with tens of words. Improving the model to semantically embed sentences is left for our future work.
This paper has explored the bilingually-constrained recursive auto-encoders in learning phrase embeddings, which can distinguish phrases with different semantic meanings. With the objective to minimize the semantic distance between translation equivalents and maximize the semantic distance between non-translation pairs simultaneously, the learned model can semantically embed any phrase in two languages and can transform the semantic space in one language to the other. Two end-to-end SMT tasks are involved to test the power of the proposed model at learning the semantic phrase embeddings. The experimental results show that the BRAE model is remarkably effective in phrase table pruning and decoding with phrasal semantic similarities.
We have also discussed many other potential applications and extensions of our BRAE model. In the future work, we will explore four directions. 1) we will try to model the decoding process with DNN based on our semantic embeddings of the basic translation units. 2) we are going to learn semantic phrase embeddings with the paraphrase corpus. 3) we will apply the BRAE model in other monolingual and cross-lingual tasks. 4) we plan to learn semantic sentence embeddings by automatically learning different weight matrices for different nodes in the BRAE model.
We thank Nan Yang for sharing the baseline code and anonymous reviewers for their valuable comments. The research work has been partially funded by the Natural Science Foundation of China under Grant No. 61333018 and 61303181, and Hi-Tech Research and Development Program (â863â Program) of China under Grant No. 2012AA011102.