Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints

Xiaodong Zeng

{}^{\dagger}

Lidia S. Chao

{}^{\dagger}

Derek F. Wong

{}^{\dagger}

Isabel Trancoso

{}^{\ddagger}

Liang Tian

{}^{\dagger}

{}^{\dagger}

NLP

{}^{2}

CT Lab / Department of Computer and Information Science, University of Macau

{}^{\ddagger}

INESC-ID / Instituto Superior Ténico, Lisboa, Portugal
nlp2ct.samuel@gmail.com, {lidiasc, derekfw}@umac.mo,
isabel.trancoso@inesc-id.pt,tianliang0123@gmail.com

Abstract

This study investigates on building a better Chinese word segmentation model for statistical machine translation. It aims at leveraging word boundary information, automatically learned by bilingual character-based alignments, to induce a preferable segmentation model. We propose dealing with the induced word boundaries as soft constraints to bias the continuous learning of a supervised CRFs model, trained by the treebank data (labeled), on the bilingual data (unlabeled). The induced word boundary information is encoded as a graph propagation constraint. The constrained model induction is accomplished by using posterior regularization algorithm. The experiments on a Chinese-to-English machine translation task reveal that the proposed model can bring positive segmentation effects to translation quality.

1 Introduction

Word segmentation is regarded as a critical procedure for high-level Chinese language processing tasks, since Chinese scripts are written in continuous characters without explicit word boundaries (e.g., space in English). The empirical works show that word segmentation can be beneficial to Chinese-to-English statistical machine translation (SMT) [23, 4, 31]. In fact most current SMT models assume that parallel bilingual sentences should be segmented into sequences of tokens that are meant to be “words” [12]. The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank (CTB) [25]. These models are conducive to MT to some extent, since they commonly have relatively good aggregate performance and segmentation consistency [4]. But one outstanding problem is that these models may leave out some crucial segmentation features for SMT, since the output words conform to the treebank segmentation standard designed for monolingually linguistic intuition, rather than specific to the SMT task.

In recent years, a number of works [23, 4, 12, 22] attempted to build segmentation models for SMT based on bilingual unsegmented data, instead of monolingual segmented data. They proposed to learn gainful bilingual knowledge as golden-standard segmentation supervisions for training a bilingual unsupervised model. Frequently, the bilingual knowledge refers to the mappings of an individual English word to one or more consecutive Chinese characters, generated via statistical character-based alignment. They leverage such mappings to either constitute a Chinese word dictionary for maximum-matching segmentation [24], or form labeled data for training a sequence labeling model [18]. The prior works showed that these models help to find some segmentations tailored for SMT, since the bilingual word occurrence feature can be captured by the character-based alignment [15]. However, these models tend to miss out other linguistic segmentation patterns as monolingual supervised models, and suffer from the negative effects of erroneously alignments to word segmentation.

This paper proposes an alternative Chinese Word Segmentation (CWS) model adapted to the SMT task, which seeks not only to maintain the advantages of a monolingual supervised model, having hand-annotated linguistic knowledge, but also to assimilate the relevant bilingual segmentation nature. We propose leveraging the bilingual knowledge to form learning constraints that guide a supervised segmentation model toward a better solution for SMT. Besides the bilingual motivated models, character-based alignment is also employed to achieve the mappings of the successive Chinese characters and the target language words. Instead of directly merging the characters into concrete segmentations, this work attempts to extract word boundary distributions for character-level trigrams (types) from the “chars-to-word” mappings. Furthermore, these word boundaries are encoded into a graph propagation (GP) expression, in order to widen the influence of the induced bilingual knowledge among Chinese texts. The GP expression constrains similar types having approximated word boundary distributions. Crucially, the GP expression with the bilingual knowledge is then used as side information to regularize a CRFs (conditional random fields) model’s learning over treebank and bitext data, based on the posterior regularization (PR) framework [9]. This constrained learning amounts to a jointly coupling of GP and CRFs, i.e., integrating GP into the estimation of a parametric structural model.

This paper is structured as follows: Section 2 points out the main differences with the related works of this study. Section 3 presents the details of the proposed segmentation model. Section 4 reports the experimental results of the proposed model for a Chinese-to-English MT task. The conclusion is drawn in Section 5.

2 Related Work

In the literature, many approaches have been proposed to learn CWS models for SMT. They can be put into two categories, monolingual-motivated and bilingual-motivated. The former primarily optimizes monolingual supervised models according to some predefined segmentation properties that are manually summarized from empirical MT evaluations. Chang et al. [4] enhanced a CRFs segmentation model in MT tasks by tuning the word granularity and improving the segmentation consistence. Zhang et al. [29] produced a better segmentation model for SMT by concatenating various corpora regardless of their different specifications. Distinct from their behaviors, this work uses automatically learned constraints instead of manually defined ones. Most importantly, the constraints have a better learning guidance since they originate from the bilingual texts. On the other hand, the bilingual-motivated CWS models typically rely on character-based alignments to generate segmentation supervisions. Xu et al. [24] proposed to employ “chars-to-word” alignments to generate a word dictionary for maximum matching segmentation in SMT task. The works in [12, 31] extended the dictionary extraction strategy. Ma and Way [12] adopted co-occurrence frequency metric to iteratively optimize “candidate words” extract from the alignments. Zhao et al. [31] attempted to find an optimal subset of the dictionary learned by the character-based alignment to maximize the MT performance. Paul et al. [18] used the words learned from “chars-to-word” alignments to train a maximum entropy segmentation model. Rather than playing the “hard” uses of the bilingual segmentation knowledge, i.e., directly merging “char-to-word” alignments to words as supervisions, this study extracts word boundary information of characters from the alignments as soft constraints to regularize a CRFs model’s learning.

The graph propagation (GP) technique provides a natural way to represent data in a variety of target domains (Belkin et al., 2006). In this technique, the constructed graph has vertices consisting of labeled and unlabeled examples. Pairs of vertices are connected by weighted edges encoding the degree to which they are expected to have the same label (Zhu et al., 2003). Many recent works, such as by Subramanya et al. [20], Das and Petrov [6], Zeng et al. [27, 26] and Zhu et al. [32], proposed GP for inferring the label information of unlabeled data, and then leverage these GP outcomes to learn a semi-supervised scalable model (e.g., CRFs). These approaches are referred to as pipelined learning with GP. This study also works with a similarity graph, encoding the learned bilingual knowledge. But, unlike the prior pipelined approaches, this study performs a joint learning behavior in which GP is used as a learning constraint to interact with the CRFs model estimation.

One of our main objectives is to bias CRFs model’s learning on unlabeled data, under a non-linear GP constraint encoding the bilingual knowledge. This is accomplished by the posterior regularization (PR) framework [9]. PR performs regularization on posteriors, so that the learned model itself remains simple and tractable, while during learning it is driven to obey the constraints through setting appropriate parameters. The closest prior study is constrained learning, or learning with prior knowledge. Chang et al. [4] described constraint driven learning (CODL) that augments model learning on unlabeled data by adding a cost for violating expectations of constraint features designed by domain knowledge. Mann and McCallum [13] and McCallum et al. [14] proposed to employ generalized expectation criteria (GE) to specify preferences about model expectations in the form of linear constraints on some feature expectations.

3 Methodology

This work aims at building a CWS model adapted to the SMT task. The model induction is shown in Algorithm 1. The input data requires two types of training resources, segmented Chinese sentences from treebank $\mathcal{D}_{l}^{c}$ and parallel unsegmented sentences of Chinese and foreign language $\mathcal{D}_{u}^{c}$ and $\mathcal{D}_{u}^{f}$ . The first step is to conduct character-based alignment over bitexts $\mathcal{D}_{u}^{c}$ and $\mathcal{D}_{u}^{f}$ , where every Chinese character is an alignment target. Here, we are interested on $n$ -to-1 alignment patterns, i.e., one target word is aligned to one or more source Chinese characters. The second step aims to collect word boundary distributions for all types, i.e., character-level trigrams, according to the $n$ -to-1 mappings (Section 3.1). The third step is to encode the induced word boundary information into a $k$ -nearest-neighbors ( $k$ -NN) similarity graph constructed over the entire set of types from $\mathcal{D}_{l}^{c}$ and $\mathcal{D}_{u}^{c}$ (Section 3.2). The final step trains a discriminative sequential labeling model, conditional random fields, on $\mathcal{D}_{l}^{c}$ and $\mathcal{D}_{u}^{c}$ under bilingual constraints in a graph propagation expression (Section 3.3). This constrained learning is carried out based on posterior regularization (PR) framework [9]. {algorithm} CWS model induction with bilingual constraints {algorithmic}[1] \REQUIRE
Segmented Chinese sentences from treebank $\mathcal{D}_{l}^{c}$ ; Parallel sentences of Chinese and foreign language $\mathcal{D}_{u}^{c}$ and $\mathcal{D}_{u}^{f}$ \ENSURE
$\theta$ : the CRFs model parameters \STATE $\mathcal{D}^{c\leftrightarrow f}\leftarrow$ char_align_bitext $(\mathcal{D}^{c}_{u},\mathcal{D}^{f}_{u})$ \STATE $r\leftarrow$ learn_word_bound $(\mathcal{D}^{c\leftrightarrow f})$ \STATE $\mathcal{G}\leftarrow{\textbf{\small encode\_graph\_constraint}}~{}(\mathcal{D% }^{c}_{l},\mathcal{D}^{c}_{u},r)$ \STATE $\theta\leftarrow{\textbf{\small pr\_crf\_graph}}~{}(\mathcal{D}^{c}_{l},% \mathcal{D}^{c}_{u},\mathcal{G})$

3.1 Word Boundaries Learned from Character-based Alignments

The gainful supervisions toward a better segmentation solution for SMT are naturally extracted from MT training resources, i.e., bilingual parallel data. This study employs an approximated method introduced in [24, 12, 5] to learn bilingual segmentation knowledge. This relies on statistical character-based alignment: first, every Chinese character in the bitexts is divided by a white space so that individual characters are regarded as special “words” or alignment targets, and second, they are connected with English words by using a statistical word aligner, e.g., GIZA++ [15]. Note that the aligner is restricted to use an $n$ -to-1 alignment pattern. The primary idea is that consecutive Chinese characters are grouped to a candidate word, if they are aligned to the same foreign word. It is worth mentioning that prior works presented a straightforward usage for candidate words, treating them as golden segmentations, either dictionary units or labeled resources. But this study treats the induced candidate words in a different way. We propose to extract the word boundary distributions¹¹The distribution is on four word boundary labels indicating the character positions in a word, i.e., B (begin), M (middle), E (end) and S (single character). for character-level trigrams (type)²²A word boundary distribution corresponds to the center character of a type. In fact, it aims at reducing label ambiguities to collect boundary information of character trigrams, rather than individual characters [1]., as shown in Figure 1, instead of the very specific words. There are two main reasons to do so. First, it is a more general expression which can reduce the impact amplification of erroneous character alignments. Second, boundary distributions can play more flexible roles as constraints over labelings to bias the model learning.

The type-level word boundary extraction is formally described as follows. Given the $i$ th sentence pair $\langle x_{i}^{c},x_{i}^{f},\mathcal{A}_{i}^{c\to f}\rangle$ of the aligned bilingual corpus $\mathcal{D}^{c\leftrightarrow f}$ , the Chinese sentence $x_{i}^{c}$ consisting of $m$ characters $\{x_{i,1}^{c},x_{i,2}^{c},...,x_{i,m}^{c}\}$ , and the foreign language sentence $x_{i}^{f}$ , consisting of $n$ words $\{x_{i,1}^{f},x_{i,2}^{f},...,x_{i,n}^{f}\}$ , $\mathcal{A}_{i}^{c\to f}$ represents a set of alignment pairs $a_{j}=\langle C_{j},x_{i,j}^{f}\rangle$ that defines connections between a few Chinese characters $C_{j}=\{x_{i,j_{1}}^{c},x_{i,j_{2}}^{c},...,x_{i,j_{k}}^{c}\}$ and a single foreign word $x_{i,j}^{f}$ . For an alignment $a_{j}=\langle C_{j},x_{i,j}^{f}\rangle$ , only the sequence of characters $C_{j}=\{x_{i,j_{1}}^{c},x_{i,j_{2}}^{c},...,x_{i,j_{k}}^{c}\}\ \forall d\in[1,% k-1],j_{d+1}-j_{d}=1$ constitutes a valid candidate word. For the whole bilingual corpus, we assign each character in the candidate words with a word boundary tag $T\in\{B,M,E,S\}$ , and then count across the entire corpus to collect the tag distributions $r_{i}=\{r_{i,t};t\in T\}$ for each type $x_{i,j-1}^{c}x_{i,j}^{c}x_{i,j+1}^{c}$ .

Figure 1: An example of similarity graph over character-level trigrams (types).

3.2 Constraints Encoded by Graph Propagation Expression

The previous step contributes to generate bilingual segmentation supervisions, i.e., type-level word boundary distributions. An intuitive manner is to directly leverage the induced boundary distributions as label constraints to regularize segmentation model learning, based on a constrained learning algorithm. This study, however, makes further efforts to elevate the positive effects of the bilingual knowledge via the graph propagation technique. We adopt a similarity graph to encode the learned type-level word boundary distributions. The GP expression will be defined as a PR constraint in Section 3.3 that reflects the interactions between the graph and the CRFs model. In other words, GP is integrated with estimation of parametric structural model. This is greatly different from the prior pipelined approaches [20, 6, 27], where GP is run first and its propagated outcomes are then used to bias the structural model. This work seeks to capture the GP benefits during the modeling of sequential correlations.

In what follows, the graph setting and propagation expression are introduced. As in conventional GP examples [7], a similarity graph $\mathcal{G}=(V,E)$ is constructed over $N$ types extracted from Chinese training data, including treebank $\mathcal{D}_{l}^{c}$ and bitexts $\mathcal{D}_{u}^{c}$ . Each vertex $V_{i}$ has a $|T|$ -dimensional estimated measure $v_{i}=\{v_{i,t};t\in T\}$ representing a probability distribution on word boundary tags. The induced type-level word boundary distributions $r_{i}=\{r_{i,t};t\in T\}$ are empirical measures for the corresponding $M$ graph vertices. The edges $E\in V_{i}\times V_{j}$ connect all the vertices. Scores between pairs of graph vertices (types), $w_{ij}$ , refer to the similarities of their syntactic environment, which are computed following the method in [20, 6, 27]. The similarities are measured based on co-occurrence statistics over a set of predefined features (introduced in Section 4.1). Specifically, the point-wise mutual information (PMI) values, between vertices and each feature instantiation that they have in common, are summed to sparse vectors, and their cosine distances are computed as the similarities. The nature of this similarity graph enforces that the connected types with high weights appearing in different texts should have similar word boundary distributions.

The quality (smoothness) of the similarity graph can be estimated by using a standard propagation function, as shown in Equation 1. The square-loss criterion [33, 3] is used to formulate this function:

\begin{array}[]{l}\displaystyle\mathcal{P}(v)=\sum_{t=1}^{T}\Bigg(\sum_{i=1}^{% M}(v_{i,t}-r_{i,t})^{2}\\ \displaystyle+\mu\sum_{j=1}^{N}\sum_{i=1}^{N}w_{ij}(v_{i,t}-v_{j,t})^{2}+\rho% \sum_{i=1}^{N}(v_{i,t})^{2}\Bigg)\end{array}

(1)

The first term in this equation refers to seed matches that compute the distances between the estimated measure $v_{i}$ and the empirical probabilities $r_{i}$ . The second term refers to edge smoothness that measures how vertices $v_{i}$ are smoothed with respect to the graph. Two types connected by an edge with high weight should be assigned similar word boundary distributions. The third term, a $\ell_{2}$ norm, evaluates the distribution sparsity [7] per vertex. Typically, the GP process amounts to an optimization process with respect to parameter $v$ such that Equation 1 is minimized. This propagation function can be used to reflect the graph smoothness, where the higher the score, the lower the smoothness.

3.3 PR Learning with GP Constraint

Our learning problem belongs to semi-supervised learning (SSL), as the training is done on treebank labeled data $(\mathrm{X}_{L},\mathrm{Y}_{L})=\{(\mathrm{x}_{1},\mathrm{y}_{1}),...,(\mathrm% {x}_{l},\mathrm{y}_{l})\}$ , and bilingual unlabeled data $(\mathrm{X}_{U})=\{\mathrm{x}_{1},...,\mathrm{x}_{u}\}$ where $\mathrm{x}_{i}=\{x^{1},...,x^{m}\}$ is an input word sequence and $\mathrm{y}_{i}=\{y^{1},...,y^{m}\},y\in T$ is its corresponding label sequence. Supervised linear-chain CRFs can be modeled in a standard conditional log-likelihood objective with a Gaussian prior:

\begin{array}[]{l}\displaystyle\mathcal{L}(\theta)=p_{\theta}(\mathrm{y}_{i}|% \mathrm{x}_{i})-\frac{\Arrowvert\theta\Arrowvert^{2}}{2\sigma}\end{array}

(2)

The conditional probabilities $p_{\theta}$ are expressed as a log-linear form:

\begin{array}[]{l}\displaystyle p_{\theta}(\mathrm{y}_{i}|\mathrm{x}_{i})=% \frac{\displaystyle\mathrm{exp}(\sum_{k=1}^{m}\theta^{\mathrm{T}}f(y_{i}^{k-1}% ,y_{i}^{k},\mathrm{x}_{i}))}{Z_{\theta}(\mathrm{x}_{i})}\end{array}

(3)

Where $Z_{\theta}(\mathrm{x}_{i})$ is a partition function that normalizes the exponential form to be a probability distribution, and $f(y_{i}^{k-1},y_{i}^{k},\mathrm{x}_{i})$ are arbitrary feature functions.

In our setting, the CRFs model is required to learn from unlabeled data. This work employs the posterior regularization (PR) framework³³The readers are refered to the original paper of Ganchev et al. [9]. [9] to bias the CRFs model’s learning on unlabeled data, under a constraint encoded by the graph propagation expression. It is expected that similar types in the graph should have approximated expected taggings under the CRFs model. We follow the approach introduced by [10] to set up a penalty-based PR objective with GP: the CRFs likelihood is modified by adding a regularization term, as shown in Equation 4, representing the constraints:

\begin{array}[]{l}\displaystyle\mathcal{R}_{U}(\theta,q)=\mathrm{KL}(q||p_{% \theta})+\lambda\mathcal{P}(v)\end{array}

(4)

Rather than regularize CRFs model’s posteriors $p_{\theta}(\mathcal{Y}|\mathrm{x}_{i})$ directly, our model uses an auxiliary distribution $q(\mathcal{Y}|\mathrm{x}_{i})$ over the possible labelings $\mathcal{Y}$ for $\mathrm{x}_{i}$ , and penalizes the CRFs marginal log-likelihood by a KL-divergence term⁴⁴The form of KL term: $\mathrm{KL}(q||p)=\sum_{q\in\mathcal{Y}}q(\mathrm{y})\log\frac{q(\mathrm{y})}{% p(\mathrm{y})}$ ., representing the distance between the estimated posteriors $p$ and the desired posteriors $q$ , as well as a penalty term, formed by the GP function. The hyperparameter $\lambda$ is used to control the impacts of the penalty term. Note that the penalty is fired if the graph score computed based on the expected taggings given by the current CRFs model is increased vis-a-vis the previous training iteration. This nature requires that the penalty term $\mathcal{P}(v)$ should be formed as a function of posteriors $q$ over CRFs model predictions⁵⁵The original PR setting also requires that the penalty term should be a linear (Ganchev et al., 2010) or non-linear [10] function on $q$ ., i.e., $\mathcal{P}(q)$ . To state this, a mapping $\mathcal{M}:(\{1,...,u\},\{1,...,m\})\rightarrow V$ from words in the corpus to vertices in the graph is defined. We can thus decompose $v_{i,t}$ into a function of $q$ as follows:

\small\begin{array}[]{l}\displaystyle v_{i,t}=\frac{\displaystyle\sum_{a=1}^{u% }\sum_{\substack{b=1;\\ \mathcal{M}(a,b)=V_{i}}}^{m}\sum_{c=1}^{T}\sum_{\mathrm{y}\in\mathcal{Y}}% \textbf{1}(y^{b}=t,y^{b-1}=c)q(\mathrm{y}|\mathrm{x}_{a})}{\displaystyle\sum_{% a=1}^{u}\sum_{b=1}^{m}\textbf{1}(\mathcal{M}(a,b)=V_{i})}\end{array}

(5)

The final learning objective combines the CRFs likelihood with the PR regularization term: $\mathcal{J}(\theta,q)=\mathcal{L}(\theta)+\mathcal{R}_{U}(\theta,q)$ . This joint objective, over $\theta$ and $q$ , can be optimized by an expectation maximization (EM) style algorithm as reported in [9]. We start from initial parameters $\theta^{0}$ , estimated by supervised CRFs model training on treebank data. The E-step is to minimize $\mathcal{R}_{U}(\theta,q)$ over the posteriors $q$ that are constrained to the probability simplex. Since the penalty term $\mathcal{P}(v)$ is a non-linear form, the optimization method in [9] via projected gradient descent on the dual is inefficient⁶⁶According to [10], the dual of quadratic program implies an expensive matrix inverse.. This study follows the optimization method [10] that uses exponentiated gradient descent (EGD) algorithm. It allows that the variable update expression, as shown in Equation 6, takes a multiplicative rather than an additive form.

\begin{array}[]{l}\displaystyle q^{(w+1)}(\mathrm{y}|\mathrm{x}_{i})=q^{(w)}(% \mathrm{y}|\mathrm{x}_{i})\exp(-\eta\frac{\partial\mathcal{R}}{\partial q^{(w)% }(\mathrm{y}|\mathrm{x}_{i})})\end{array}

(6)

where the parameter $\eta$ controls the optimization rate in the E-step. With the contributions from the E-step that further encourage $q$ and $p$ to agree, the M-step aims to optimize the objective $\mathcal{J}(\theta,q)$ with respect to $\theta$ . The M-step is similar to the standard CRFs parameter estimation, where the gradient ascent approach still works. This EM-style approach monotonically increases $\mathcal{J}(\theta,q)$ and thus is guaranteed to converge to a local optimum.

\begin{array}[]{l}\displaystyle\text{\bf E-step:}\quad q^{(t+1)}=\underset{q}{% \operatorname{arg\,min}}\mathcal{R}_{U}(\theta^{(t)},q^{(t)})\\ \displaystyle\text{\bf M-step:}\quad\theta^{(t+1)}=\underset{\theta}{% \operatorname{arg\,max}}\mathcal{L}(\theta)\\ \displaystyle\quad\quad\quad\quad\quad\ \ +\delta\sum_{i=1}^{u}\sum_{\mathrm{y% }\in\mathcal{Y}}q^{(t+1)}(\mathrm{y}|\mathrm{x}_{i})\log p_{\theta}(\mathrm{y}% |\mathrm{x}_{i})\end{array}

(7)

4 Experiments

4.1 Data and Setup

The experiments in this study evaluated the performances of various CWS models in a Chinese-to-English translation task. The influence of the word segmentation on the final translation is our main investigation. We adopted three state-of-the-art metrics, BLEU [17], NIST [8] and METEOR [2], to evaluate the translation quality.

The monolingual segmented data, $\mathrm{train_{TB}}$ , is extracted from the Penn Chinese Treebank (CTB-7) [25], containing 51,447 sentences. The bilingual training data, $\mathrm{train_{MT}}$ , is formed by a large in-house Chinese-English parallel corpus [21]. There are in total 2,244,319 Chinese-English sentence pairs crawled from online resources, concentrated in 5 different domains including laws, novels, spoken, news and miscellaneous⁷⁷The in-house corpus has been manually validated, in a long process that exceeded 500 hours.. This in-house bilingual corpus is the MT training data as well. The target-side language model is built on over 35 million monolingual English sentences, $\mathrm{train_{LM}}$ , crawled from online resources. The NIST evaluation campaign data, MT-03 and MT-05, are selected to comprise the MT development data, $\mathrm{dev_{MT}}$ , and testing data, $\mathrm{test_{MT}}$ , respectively.

For the settings of our model, we adopted the standard feature templates introduced by Zhao et al. [30] for CRFs. The character-based alignment for achieving the “chars-to-word” mappings is accomplished by GIZA++ aligner [15]. For the GP, a 10-NNs similarity graph was constructed⁸⁸We evaluated graphs with top $k$ (from 3 to 20) nearest neighbors on development data, and found that the performance converged beyond 10-NNs.. Following [20, 27], the features used to compute similarities between vertices were (Suppose given a type “ $w_{2}w_{3}w_{4}$ ” surrounding contexts “ $w_{1}w_{2}w_{3}w_{4}w_{5}$ ”): unigram ( $w_{3}$ ), bigram ( $w_{1}w_{2}$ , $w_{4}w_{5}$ , $w_{2}w_{4}$ ), trigram ( $w_{2}w_{3}w_{4}$ , $w_{2}w_{4}w_{5}$ , $w_{1}w_{2}w_{4}$ ), trigram+context ( $w_{1}w_{2}w_{3}w_{4}w_{5}$ ) and character classes in number, punctuation, alphabetic letter and other ( $t(w_{2})t(w_{3})t(w_{4})$ ). There are four hyperparameters in our model to be tuned by using the development data ( $\mathrm{dev_{MT}}$ ) among the following settings: for the graph propagation, $\mu\in\{0.2,0.5,0.8\}$ and $\rho\in\{0.1,0.3,0.5,0.8\}$ ; for the PR learning, $\lambda\in\{0\leq\lambda_{i}\leq 1\}$ and $\sigma\in\{0\leq\sigma_{i}\leq 1\}$ where the step is 0.1. The best performed joint settings, $\mu=0.5,\rho=0.5,\lambda=0.9$ and $\sigma=0.8$ , were used to measure the final performance.

The MT experiment was conducted based on a standard log-linear phrase-based SMT model. The GIZA++ aligner was also adopted to obtain word alignments [15] over the segmented bitexts. The heuristic strategy of grow-diag-final-and [11] was used to combine the bidirectional alignments for extracting phrase translations and reordering tables. A 5-gram language model with Kneser-Ney smoothing was trained with SRILM [19] on monolingual English data. Moses [11] was used as decoder. The Minimum Error Rate Training (MERT) [16] was used to tune the feature parameters on development data.

4.2 Various Segmentation Models

To provide a thorough analysis, the MT experiments in this study evaluated three baseline segmentation models and two off-the-shelf models, in addition to four variant models that also employ the bilingual constraints. We start from three baseline models:

•

Character Segmenter (CS): this model simply divides Chinese sentences into sequences of characters.
•

Supervised Monolingual Segmenter (SMS): this model is trained by CRFs on treebank training data ( $\mathrm{train_{TB}}$ ). The same feature templates [30] are used. The standard four-tags (B, M, E and S) were used as the labels. The stochastic gradient descent is adopted to optimize the parameters.
•

Unsupervised Bilingual Segmenter (UBS): this model is trained on the bitexts (trainMT) following the approach introduced in [12]. The optimal set of the model parameter values was found on $\mathrm{dev_{MT}}$ to be $k=3,t_{AC}=0.0$ and $t_{COOC}=15$ .

The comparison candidates also involve two popular off-the-shelf segmentation models:

•

Stanford Segmenter: this model, trained by Chang et al. [4], treats CWS as a binary word boundary decision task. It covers several features specific to the MT task, e.g., external lexicons and proper noun features.
•

ICTCLAS Segmenter: this model, trained by Zhang et al. [28], is a hierarchical HMM segmenter that incorporates parts-of-speech (POS) information into the probability models and generates multiple HMM models for solving segmentation ambiguities.

This work also evaluated four variant models⁹⁹Note that there are two variant models working with GP. To be fair, the same similarity graph settings introduced in this paper were used. that perform alternative ways to incorporate the bilingual constraints based on two state-of-the-art graph-based SSL approaches.

•

Self-training Segmenters (STS): two variant models were defined by the approach reported in [20] that uses the supervised CRFs model’s decodings, incorporating empirical and constraint information, for unlabeled examples as additional labeled data to retrain a CRFs model. One variant (STS-NO-GP) skips the GP step, directly decoding with type-level word boundary probabilities induced from bitexts, while the other (STS-GP-PL) runs the GP at first and then decodes with GP outcomes. The optimal hyperparameter values were found to be: STS-NO-GP ( $\alpha=0.8$ ) and $\eta=0.6$ ) and STS-GP-PL ( $\mu=0.5,\rho=0.3,\alpha=0.8$ and $\eta=0.6$ ).
•

Virtual Evidences Segmenters (VES): Two variant models based on the approach in [27] were defined. The type-level word boundary distributions, induced by the character-based alignment (VES-NO-GP), and the graph propagation (VES-GP-PL), are regarded as virtual evidences to bias CRFs model’s learning on the unlabeled data. The optimal hyperparameter values were found to be: VES-NO-GP ( $\alpha=0.7$ ) and VES-GP-PL ( $\mu=0.5,\rho=0.3$ and $\alpha=0.7$ ).

4.3 Main Results

Table 1 summarizes the final MT performance on the MT-05 test data, evaluated with ten different CWS models. In what follows, we summarized four major observations from the results. Firstly, as expected, having word segmentation does help Chinese-to-English MT. All other nine CWS models outperforms the CS baseline which does not try to identify Chinese words at all. Secondly, the other two baselines, SMS and UBS, are on a par with each other, showing less than 0.36 average performance differences on the three evaluation metrics. This outcome validated that the models, trained by either the treebank or the bilingual data, performed reasonably well. But they only capture partial segmentation features so that less gains for SMT are achieved when comparing to other sophisticated models. Thirdly, we notice that the two off-the-shelf models, Stanford and ICTCLAS, just brought minor improvements over the SMS baseline, although they are trained using richer supervisions. This behaviour illustrates that the conventional optimizations to the monolingual supervised model, e.g., accumulating more supervised data or predefined segmentation properties, are insufficient to help model for achieving better segmentations for SMT. Finally, highlighting the five models working with the bilingual constraints, most of them can achieve significant gains over the other ones without using the bilingual constraints. This strongly demonstrates that bilingually-learned segmentation knowledge does helps CWS for SMT. The models working with GP, STS-GP-PL, VES-GP-PL and ours outperform all others. We attribute this to the role of GP in assisting the spread of bilingual knowledge on the Chinese side. Importantly, it can be observed that our model outperforms STS-GP, VES-GP, which greatly supports that joint learning of CRFs and GP can alleviate the error transfer by the pipelined models. This is one of the most crucial findings in this study. Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU, NIST and METEOR over others.

Models	BLEU	NIST	METEOR
CS	29.38	59.85	54.07
SMS	30.05	61.33	55.95
UBS	30.15	61.56	55.39
Stanford	30.40	61.94	56.01
ICTCLAS	30.29	61.26	55.72
STS-NO-GP	31.47	62.35	56.12
STS-GP-PL	31.94	63.20	57.09
VES-NO-GP	31.98	62.63	56.59
VES-GP-PL	32.04	63.49	57.34
Our Model	32.75	63.72	57.64

Table 1: Translation performances (%) on MT-05 testing data by using ten different CWS models.

4.4 Analysis & Discussion

This section aims to further analyze the three primary observations concluded in Section 4.3: $i$ ) word segmentation is useful to SMT; $i i$ ) the treebank and the bilingual segmentation knowledge are helpful, performing segmentation of different nature; and $i i i$ ) the bilingual constraints lead to learn segmentations better tailored for SMT.

The first observation derives from the comparisons between the CS baseline and other models. Our results, showing the significant CWS benefits to SMT, are consistent with the works reported in the literature [24, 4]. In our experiment, two additional evidences found in the translation model are provided to further support that NO tokenization of Chinese (i.e., the CS model’s output) could harm the MT system. First, the SMT phrase extraction, i.e., building “phrases” on top of the character sequences, cannot fully capture all meaningful segmentations produced by the CS model. The character based model leads to missing some useful longer phrases, and to generate many meaningless or redundant translations in the phrase table. Moreover, it is affected by translation ambiguities, caused by the cases where a Chinese character has very different meanings in different contextual environments.

The second observation shifts the emphasis to SMS and UBS, based on the treebank and the bilingual segmentation, respectively. Our results show that both segmentation patterns can bring positive effects to MT. Through analyzing both models’ segmentations for $\mathrm{train_{MT}}$ and $\mathrm{test_{MT}}$ , we attempted to get a closer inspection on the segmentation preferences and their influence on MT. Our first finding is that the segmentation consensuses between SMS and UBS are positive to MT. There have about 35% identical segmentations produced by the two models. If these identical segmentations are removed, and the experiments are rerun, the translation scores decrease (on average) by 0.50, 0.85 and 0.70 on BLEU, NIST and METEOR, respectively. Our second finding is that SMS exhibits better segmentation consistency than UBS. One representative example is the segmentations for “{CJK}UTF8gbsnå¤é¶é¶ (lonely)”. All the outputs of SMS were “{CJK}UTF8gbsnå¤é¶é¶”, while UBS generated three ambiguous segmentations, “{CJK}UTF8gbsnå¤(alone)_{CJK}UTF8gbsné¶é¶(double zero)”, “{CJK}UTF8gbsnå¤é¶(lonely)_{CJK}UTF8gbsné¶(zero)” and “{CJK}UTF8gbsnå¤(alone)_{CJK}UTF8gbsné¶(zero)_{CJK}UTF8gbsné¶(zero)”. The segmentation consistency of SMS rests on the high-quality treebank data and the robust CRFs tagging model. On the other hand, the advantage of UBS is to capture the segmentations matching the aligned target words. For example, UBS grouped “{CJK}UTF8gbsnå½(country)_{CJK}UTF8gbsné(border)_{CJK}UTF8gbsné´(between)” to a word “{CJK}UTF8gbsnå½éé´(international)”, rather than two words “{CJK}UTF8gbsnå½é(international)_{CJK}UTF8gbsné´(between)” (as given by SMS), since these three characters are aligned to a single English word “international”. The above analysis shows that SMS and UBS have their own merits and combining the knowledge derived from both segmentations is highly encouraged.

The third observation concerns the great impact of the bilingual constraints to the segmentation models in the MT task. The use of the bilingual constraints is the prime objective of this study. Our first contribution for this purpose is on using the word boundary distributions to capture the bilingual segmentation supervisions. This representation contributes to reduce the negative impacts of erroneous “chars-to-word” alignments. The ambiguous types (having relatively uniform boundary distribution), caused by alignment errors, cannot directly bias the model tagging preferences. Furthermore, the word boundary distributions are convenient to make up the learning constraints over the labelings among various constrained learning approaches. They have successfully played in three types of constraints for our experiments: PR penalty (Our model), decoding constraints in self-training (STS) and virtual evidences (VES). The second contribution is the use of GP, illustrated by STS-GP-PL, VES-GP-PL and Our model. The major effect is to multiply the impacts of the bilingual knowledge through the similarity graph. The graph vertices (types)¹⁰¹⁰This experiment yielded a similarity graph that consists of 11,909,620 types from $\mathrm{train_{TB}}$ and $\mathrm{train_{MT}}$ , where there have 8,593,220 (72.15%) types without any empirical boundary distributions., without any supervisions, can learn the word boundary information from their similar types (neighborhoods) having the empirical boundary probabilities. The segmentations given by the three GP models show about 70% positive segmentation changes, affected by the unlabeled graph vertices, with respect to the ones given by the NO-GP models, STS-NO-GP and VES-NO-GP. In our opinion, the learning mechanism of our approach, joint coupling of GP and CRFs, rather than the pipelined one as the other two models, contributes to maximizing the graph smoothness effects to the CRFs estimation so that the error propagation of the pipelined approaches is alleviated.

5 Conclusion

This paper proposed a novel CWS model for the SMT task. This model aims to maintain the linguistic segmentation supervisions from treebank data and simultaneously integrate useful bilingual segmentations induced from the bitexts. This objective is accomplished by three main steps: 1) learn word boundaries from character-based alignments; 2) encode the learned word boundaries into a GP constraint; and 3) training a CRFs model, under the GP constraint, by using the PR framework. The empirical results indicate that the proposed model can yield better segmentations for SMT.

Acknowledgments

The authors are grateful to the Science and Technology Development Fund of Macau and the Research Committee of the University of Macau (Grant No. MYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS) for the funding support for our research. The work of Isabel Trancoso was supported by national funds through FCT-Fundação para a Ciêcia e a Tecnologia, under project PEst-OE/EEI/LA0021/2013. The authors also wish to thank the anonymous reviewers for many helpful comments.

References

[1] Y. Altun, D. McAllester and M. Belkin(2006) Maximum margin semi-supervised learning for structured variables. Advances in Neural Information Processing Systems 18, pp. 33. Cited by: 3.1.
[2] S. Banerjee and A. Lavie(2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. pp. 65–72. Cited by: 4.1.
[3] Y. Bengio, O. Delalleau and N. Le Roux(2006) Label propagation and quadratic criterion. Semi-Supervised Learning, pp. 193–216. Cited by: 3.2.
[4] P. Chang, M. Galley and C. D. Manning(2008) Optimizing Chinese word segmentation for machine translation performance. pp. 224–232. Cited by: 4.2, 1, 1, 2, 2, 4.4.
[5] T. Chung and D. Gildea(2009) Unsupervised tokenization for machine translation. pp. 718–726. Cited by: 3.1.
[6] D. Das and S. Petrov(2011) Unsupervised part-of-speech tagging with bilingual graph-based projections.. pp. 600–609. Cited by: 2, 3.2, 3.2.
[7] D. Das and N. A. Smith(2012) Graph-based lexicon expansion with sparsity-inducing penalties. pp. 677–687. Cited by: 3.2, 3.2.
[8] G. R. Doddington, M. A. Przybocki, A. F. Martin and D. A. Reynolds(2000) The nist speaker recognition evaluation–overview, methodology, systems, results, perspective. Speech Communication 31 (2), pp. 225–254. Cited by: 4.1.
[9] K. Ganchev, J. Graça, J. Gillenwater and B. Taskar(2010) Posterior regularization for structured latent variable models. The Journal of Machine Learning Research 11, pp. 2001–2049. Cited by: 1, 2, 3.3, 3.3, 3.
[10] L. He, J. Gillenwater and B. Taskar(2013) Graph-based posterior regularization for semi-supervised structured prediction. pp. 38. Cited by: 3.3, 3.3.
[11] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran and R. Zens(2007) Moses: open source toolkit for statistical machine translation. pp. 177–180. Cited by: 4.1.
[12] Y. Ma and A. Way(2009) Bilingually motivated domain-adapted word segmentation for statistical machine translation. pp. 549–557. Cited by: 4.2, 1, 1, 2, 3.1.
[13] G. S. Mann and A. McCallum(2008) Generalized expectation criteria for semi-supervised learning of conditional random fields. pp. 870–878. Cited by: 2.
[14] A. McCallum, G. Mann and G. Druck(2007) Generalized expectation criteria. Computer Science Technical Note, University of Massachusetts, Amherst, MA. Cited by: 2.
[15] F. J. Och and H. Ney(2003) A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1), pp. 19–51. Cited by: 1, 3.1, 4.1, 4.1.
[16] F. J. Och(2003) Minimum error rate training in statistical machine translation. pp. 160–167. Cited by: 4.1.
[17] K. Papineni, S. Roukos, T. Ward and W. Zhu(2002) BLEU: a method for automatic evaluation of machine translation. pp. 311–318. Cited by: 4.1.
[18] M. Paul, F. Andrew and S. Eiichiro(2011) Integration of multiple bilingually-trained segmentation schemes into statistical machine translation. IEICE Transactions on Information and Systems 94 (3), pp. 690–697. Cited by: 1, 2.
[19] A. Stolcke(2002) SRILM-an extensible language modeling toolkit.. Cited by: 4.1.
[20] A. Subramanya, S. Petrov and F. Pereira(2010) Efficient graph-based semi-supervised learning of structured tagging models. pp. 167–176. Cited by: 4.2, 2, 3.2, 3.2, 4.1.
[21] L. Tian, D. F. Wong, L. S. Chao, P. Quaresma, F. Oliveira, S. Li, Y. Wang and Y. Lu(2014) UM-Corpus: a large English-Chinese parallel corpus for statistical machine translation.. Cited by: 4.1.
[22] N. Xi, G. Tang, X. Dai, S. Huang and J. Chen(2012) Enhancing statistical machine translation with character alignment. pp. 285–290. Cited by: 1.
[23] J. Xu, E. Matusov, R. Zens and H. Ney(2005) Integrated Chinese word segmentation in statistical machine translation. pp. 141–147. Cited by: 1, 1.
[24] J. Xu, R. Zens and H. Ney(2004) Do we need Chinese word segmentation for statistical machine translation?. pp. 122–128. Cited by: 1, 2, 3.1, 4.4.
[25] N. Xue, F. Xia, F. Chiou and M. Palmer(2005) The Penn Chinese TreeBank: phrase structure annotation of a large corpus. Natural Language Engineering 11 (2), pp. 207–238. Cited by: 1, 4.1.
[26] X. Zeng, D. F. Wong, L. S. Chao, I. Trancoso, L. He and Q. Huang(2014) Lexicon expansion for latent variable grammars. Pattern Recognition Letters 42, pp. 47–55. Cited by: 2.
[27] X. Zeng, D. F. Wong, L. S. Chao and I. Trancoso(2013) Graph-based semi-supervised model for joint Chinese word segmentation and part-of-speech tagging. pp. 770–779. Cited by: 4.2, 2, 3.2, 3.2, 4.1.
[28] H. Zhang, H. Yu, D. Xiong and Q. Liu(2003) HHMM-based Chinese lexical analyzer ICTCLAS. pp. 184–187. Cited by: 4.2.
[29] R. Zhang, K. Yasuda and E. Sumita(2008) Improved statistical machine translation by multiple Chinese word segmentation. pp. 216–223. Cited by: 2.
[30] H. Zhao, C. Huang and M. Li(2006) An improved Chinese word segmentation system with conditional random field. Cited by: 4.2, 4.1.
[31] H. Zhao, M. Utiyama, E. Sumita and B. Lu(2013) An empirical study on word segmentation for Chinese machine translation. Computational Linguistics and Intelligent Text Processing, pp. 248–263. Cited by: 1, 2.
[32] L. Zhu, D. F. Wong and L. S. Chao(2014) Unsupervised chunking based on graph propagation from bilingual corpus. The Scientific World Journal 2014 (401943), pp. 10. Cited by: 2.
[33] X. Zhu, Z. Ghahramani and J. Lafferty(2003) Semi-supervised learning using gaussian fields and harmonic functions. Vol. 3, pp. 912–919. Cited by: 3.2.

Generated on Tue Jun 10 19:13:24 2014 by LaTeXML [LOGO]