There are two dominant approaches to Chinese word segmentation: word-based and character-based models, each with respective strengths. Prior work has shown that gains in segmentation performance can be achieved from combining these two types of models; however, past efforts have not provided a practical technique to allow mainstream adoption. We propose a method that effectively combines the strength of both segmentation schemes using an efficient dual-decomposition algorithm for joint inference. Our method is simple and easy to implement. Experiments on SIGHAN 2003 and 2005 evaluation datasets show that our method achieves the best reported results to date on 6 out of 7 datasets.
UTF8gbsn
Chinese text is written without delimiters between words; as a result, Chinese word segmentation (CWS) is an essential foundational step for many tasks in Chinese natural language processing. As demonstrated by [15, 2, 3, 10], the quality and consistency of segmentation has important downstream impacts on system performance in machine translation, POS tagging and parsing.
State-of-the-art performance in CWS is high, with F-scores in the upper 90s. Still, challenges remain. Unknown words, also known as out-of-vocabulary (oov) words, lead to difficulties for word- or dictionary-based approaches. Ambiguity can cause errors when the appropriate segmentation is determined contextually, such as æè½ (“talent”) and æ / è½ (“just able”) [8].
There are two primary classes of models: character-based, where the foundational units for processing are individual Chinese characters [23, 19, 24, 20], and word-based, where the units are full words based on some dictionary or training lexicon [1, 25]. \newciteSun:2010:COLING details their respective theoretical strengths: character-based approaches better model the internal compositional structure of words and are therefore more effective at inducing new oov words; word-based approaches are better at reproducing the words of the training lexicon and can capture information from significantly larger contextual spans. Prior work has shown performance gains from combining these two types of models to exploit their respective strengths, but such approaches are often complex to implement and computationally expensive.
In this work, we propose a simple and principled joint decoding method for combining character-based and word-based segmenters based on dual decomposition. This method has strong optimality guarantees and works very well empirically. It is easy to implement and does not require retraining of existing character- and word-based segmenters. Perhaps most importantly, this work presents a much more practical and usable form of classifier combination in the CWS context than existing methods offer.
Experimental results on standard SIGHAN 2003 and 2005 bake-off evaluations show that our model outperforms the character and word baselines by a significant margin. In particular, out approach improves oov recall rates and segmentation consistency, and gives the best reported results to date on 6 out of 7 datasets.
In this section, we describe the character-based and word-based models we use as baselines, review existing approaches to combination, and describe our algorithm for joint decoding with dual decomposition.
In the most commonly used contemporary approach to character-based segmentation, first proposed by [23], CWS is seen as a character sequence tagging task, where each character is tagged on whether it is at the beginning, middle, or end of a word. Conditional random fields (CRF) [11] have been widely adopted for this task, and give state-of-the-art results [19]. In a first-order linear-chain CRF model, the conditional probability of a label sequence given a word sequence is defined as:
are feature functions that typically include surrounding character n-gram and morphological suffix/prefix features. These types of features capture the compositional properties of characters and are likely to generalize well to unknown words. However, the Markov assumption in CRF limits the context of such features; it is difficult to capture long-range word features in this model.
Word-based models search through lists of word candidates using scoring functions that directly assign scores to each. Early word-based segmentation work employed simple heuristics like dictionary-lookup maximum matching [4]. More recently, \newciteZhang:2007:ACL reported success using a linear model trained with the average perceptron algorithm [5]. Formally, given input , their model seeks a segmentation such that:
Searching through the entire space is intractable even with a local model, so a beam-search algorithm is used. The search algorithm consumes one character input token at a time, and iterates through the existing beams to score two new alternative hypotheses by either appending the new character to the last word in the beam, or starting a new word at the current position.
[!ht]
{algorithmic}
\STATE
\FOR to
\STATE
\STATE
\IF
\RETURN
\ENDIF\FORALL
\STATE
\ENDFOR\ENDFOR\RETURN
\STATEViterbi:
\STATE
\FOR
\STATE
\ENDFOR
\STATEBeam-Search:
\FOR
\FOR item in
\STATEappend to ,
\STATE,
\ENDFOR\ENDFOR
Academia Sinica | Peking Univ. | |||||||||
R | P | F | R | C | R | P | F | R | C | |
Char-based CRF | 95.2 | 93.6 | 94.4 | 58.9 | 0.064 | 94.6 | 95.3 | 94.9 | 77.8 | 0.089 |
Word-based Perceptron | 95.8 | 95.0 | 95.4 | 69.5 | 0.060 | 94.1 | 95.5 | 94.8 | 76.7 | 0.099 |
Dual-decomp | 95.9 | 94.9 | 95.4 | 67.7 | 0.055 | 94.8 | 95.7 | 95.3 | 78.7 | 0.086 |
City Univ. of Hong Kong | Microsoft Research | |||||||||
R | P | F | R | C | R | P | F | R | C | |
Char-based CRF | 94.7 | 94.0 | 94.3 | 76.1 | 0.065 | 96.4 | 96.6 | 96.5 | 71.3 | 0.074 |
Word-based Perceptron | 94.3 | 94.0 | 94.2 | 71.7 | 0.073 | 97.0 | 97.2 | 97.1 | 74.6 | 0.063 |
Dual-decomp | 95.0 | 94.4 | 94.7 | 75.3 | 0.062 | 97.3 | 97.4 | 97.4 | 76.0 | 0.055 |
Various mixing approaches have been proposed to combine the above two approaches [22, 12, 18, 17, 20]. These mixing models perform well on standard datasets, but are not in wide use because of their high computational costs and difficulty of implementation.
Dual decomposition (DD) [14] offers an attractive framework for combining these two types of models without incurring high costs in model complexity (in contrast to [18]) or decoding efficiency (in contrast to bagging in [22, 17]). DD has been successfully applied to similar situations for combining local with global models; for example, in dependency parsing [9], bilingual sequence tagging [21] and word alignment [6].
The idea is that jointly modelling both character-sequence and word information can be computationally challenging, so instead we can try to find outputs that the two models are most likely to agree on. Formally, the objective of DD is:
(1) |
where is the output of character-based CRF, is the output of word-based perceptron, and the agreements are expressed as constraints. s.t. is a shorthand for “such that”.
Solving this constrained optimization problem directly is difficult. Instead, we take the Lagrangian relaxation of this term as:
(2) | ||||
where is the set of Lagrangian multipliers that consists of a multiplier at each word position .
We can rewrite the original objective with the Lagrangian relaxation as:
(3) |
We can then form the dual of this problem by taking the outside of the , which is an upper bound on the original problem. The dual form can then be decomposed into two sub-components (the two problems in Eq. 4), each of which is local with respect to the set of Lagrangian multipliers:
(4) | ||||
This method is called dual decomposition (DD) [14]. Similar to previous work [13], we solve this DD problem by iteratively updating the sub-gradient as depicted in Algorithm 2.2.11See \newciteRush:2012:JAIR for a full introduction to DD. In each iteration, if the best segmentations provided by the two models do not agree, then the two models will receive penalties for the decisions they made that differ from the other. This penalty exchange is similar to message passing, and as the penalty accumulates over iterations, the two models are pushed towards agreeing with each other. We also give an updated Viterbi decoding algorithm for CRF and a modified beam-search algorithm for perceptron in Algorithm 2.2. is the maximum number of iterations before early stopping, and is the learning rate at time . We adopt a learning rate update rule from \newciteKoo:2010:EMNLP where is defined as , where is the number of times we observed a consecutive dual value increase from iteration to .
We conduct experiments on the SIGHAN 2003 [16] and 2005 [7] bake-off datasets to evaluate the effectiveness of the proposed dual decomposition algorithm. We use the publicly available Stanford CRF segmenter [19]22http://nlp.stanford.edu/software/segmenter.shtml as our character-based baseline model, and reproduce the perceptron-based segmenter from \newciteZhang:2007:ACL as our word-based baseline model.
We adopted the development setting from [25], and used CTB sections 1-270 for training and sections 400-931 for development in hyper-parameter setting; for all results given in tables, the models are trained and evaluated on the standard train/test split for the given dataset. The optimized hyper-parameters used are: regularization parameter in CRF is set to ; the perceptron is trained for 10 iterations with beam size 200; dual decomposition is run to max iteration of 100 ( in Algo. 2.2) with step size 0.1 ( in Algo. 2.2).
Beyond standard precision (P), recall (R) and F scores, we also evaluate segmentation consistency as proposed by [3], who have shown that increased segmentation consistency is correlated with better machine translation performance. The consistency measure calculates the entropy of segmentation variations — the lower the score the better. We also report out-of-vocabulary recall (R) as an estimation of the model’s generalizability to previously unseen words.
Table 1 shows our empirical results on SIGHAN 2005 dataset. Our dual decomposition method outperforms both the word-based and character-based baselines consistently across all four subsets in both F and oov recall (R). Our method demonstrates a robustness across domains and segmentation standards regardless of which baseline model was stronger. Of particular note is DD’s is much more robust in R, where the two baselines swing a lot. This is an important property for downstream applications such as entity recognition. The DD algorithm is also more consistent, which would likely lead to improvements in applications such as machine translation [3].
The improvement over our word- and character-based baselines is also seen in our results on the earlier SIGHAN 2003 dataset. Table 2 puts our method in the context of earlier systems for CWS. Our method achieves the best reported score on 6 out of 7 datasets.
SIGHAN 2005 | ||||
---|---|---|---|---|
AS | PU | CU | MSR | |
Best 05 | 95.2 | 95.0 | 94.3 | 96.4 |
Zhang et al. 06 | 94.7 | 94.5 | 94.6 | 96.4 |
Z&C 07 | 94.6 | 94.5 | 95.1 | 97.2 |
Sun et al. 09 | - | 95.2 | 94.6 | 97.3 |
Sun 10 | 95.2 | 95.2 | 95.6 | 96.9 |
Dual-decomp | 95.4 | 95.3 | 94.7 | 97.4 |
SIGHAN 2003 | ||||
Best 03 | 96.1 | 95.1 | 94.0 | |
Peng et al. 04 | 95.6 | 94.1 | 92.8 | |
Z&C 07 | 96.5 | 94.0 | 94.6 | |
Dual-decomp | 97.1 | 95.4 | 94.9 |
On the whole, dual decomposition produces state-of-the-art segmentations that are more accurate, more consistent, and more successful at inducing oov words than the baseline systems that it combines. On the SIGHAN 2005 test set, in over 99.1% of cases the DD algorithm converged within 100 iterations, which gives an optimality guarantee. In 77.4% of the cases, DD converged in the first iteration. The number of iterations to convergence histogram is plotted in Figure 1.
In many cases the relative confidence of each model means that dual decomposition is capable of using information from both sources to generate a series of correct segmentations better than either baseline model alone. The example below shows a difficult-to-segment proper name comprised of common characters, which results in undersegmentation by the character-based CRF and oversegmentation by the word-based perceptron, but our method achieves the correct middle ground.
Gloss | Tian Yage / ’s / creations | |
. | Gold . | ç°é å / ç / åä½ |
. | CRF . | ç°é åç / åä½ |
. | PCPT . | ç°é / å / ç / åä½ |
. | DD . | ç°é å / ç / åä½ |
A powerful feature of the dual decomposition approach is that it can generate correct segmentation decisions in cases where a voting or product-of-experts model could not, since joint decoding allows the sharing of information at decoding time. In the following example, both baseline models miss the contextually clear use of the word ç¹å¿ (“sweets / snack food”) and instead attach ç¹ to the prior word to produce the otherwise common compound ä¸ç¹ç¹ (“a little bit”); dual decomposition allows the model to generate the correct segmentation.
We found more than 400 such surprisingly accurate instances in our dual decomposition output.
Gloss
Enjoy / a bit of / snack food / , …
.
Gold .
享å / ä¸ç¹ / ç¹å¿ / ï¼
.
CRF .
享å / ä¸ç¹ç¹ / å¿ / ï¼
.
PCPT .
享å / ä¸ç¹ç¹ / å¿ / ï¼
.
DD .
享å / ä¸ç¹ / ç¹å¿ / ï¼
Finally, since dual decomposition is a method of joint decoding, it is still liable to reproduce errors made by the constituent systems.
In this paper we presented an approach to Chinese word segmentation using dual decomposition for system combination. We demonstrated that this method allows for joint decoding of existing CWS systems that is more accurate and consistent than either system alone, and further achieves the best performance reported to date on standard datasets for the task. Perhaps most importantly, our approach is straightforward to implement and does not require retraining of the underlying segmentation models used. This suggests its potential for broader applicability in real-world settings than existing approaches to combining character-based and word-based models for Chinese word segmentation.