It is very import for Chinese language processing with the aid of an efficient input method engine (IME), of which pinyin-to-Chinese (PTC) conversion is the core part. Meanwhile, though typos are inevitable during user pinyin inputting, existing IMEs paid little attention to such big inconvenience. In this paper, motivated by a key equivalence of two decoding algorithms, we propose a joint graph model to globally optimize PTC and typo correction for IME. The evaluation results show that the proposed method outperforms both existing academic and commercial IMEs.
The daily life of Chinese people heavily depends on Chinese input method engine (IME), no matter whether one is composing an E-mail, writing an article, or sending a text message. However, every Chinese word inputted into computer or cellphone cannot be typed through one-to-one mapping of key-to-letter inputting directly, but has to go through an IME as there are thousands of Chinese characters for inputting while only 26 letter keys are available in the keyboard. An IME is an essential software interface that maps Chinese characters into English letter combinations. An efficient IME will largely improve the user experience of Chinese information processing.
Nowadays most of Chinese IMEs are pinyin based. Pinyin is originally designed as the phonetic symbol of a Chinese character (based on the standard modern Chinese, mandarin) , using Latin letters as its syllable notation. For example, the pinyin of the Chinese character “\song爱”(love) is “ài”. Most characters usually have unique pinyin representations, while a few Chinese characters may be pronounced in several different ways, so they may have multiple pinyin representations. The advantage of pinyin IME is that it only adopts the pronunciation perspective of Chinese characters so that it is simple and easy to learn. But there are only less than 500 pinyin syllables in standard modern Chinese, compared with over 6,000 commonly used Chinese characters, which leads to serious ambiguities for pinyin-to-character mapping. Modern pinyin IMEs mostly use a “sentence-based” decoding technique [4] to alleviate the ambiguities. “Sentence based” means that IME generates a sequence of Chinese characters upon a sequence of pinyin inputs with respect to certain statistical criteria.
Written in Chinese characters but not alphabets, spell checking for Chinese language is quite different from the same task for other languages. Since Chinese characters are entered via IME, those user-made typos do not immediately lead to spelling errors. When a user types a wrong letter, IME will be very likely to fail to generate the expected Chinese character sequence. Normally, the user may immediately notice the inputting error and then make corrections, which usually means doing a bunch of extra operations like cursor movement, deletion and re-typing. Thus there are two separated sub-tasks for Chinese spell checking: 1. typo checking for user typed pinyin sequences which should be a built-in module in IME, and 2. spell checking for Chinese texts in its narrow sense, which is typically a module of word processing applications [36]. These two terms are often confused especially in IME related works such as [4] and [34].
Pinyin typos have always been a serious problem for Chinese pinyin IMEs. The user may fail to input the completely right pinyin simply because he/she is a dialect speaker and does not know the exact pronunciation for the expected character. This may be a very common situation since there are about seven quite different dialects in Chinese, among which being spoken languages, six are far different from the standard modern Chinese, mandarin. With the boom of smart-phones, pinyin typos worsen due to the limited size of soft keyboard, and the lack of physical feedback on the touch screen. However, existing practical IMEs only provide small patches to deal with typos such as Fuzzy Pinyin [33] and other language specific errors [45].
Typo checking and correction has an important impact on IME performance. When IME fails to correct a typo and generate the expected sentence, the user will have to take much extra effort to move the cursor back to the mistyped letter and correct it, which leads to very poor user experience [15].
The very first approach for Chinese input with typo correction was made by [4], which was also the initial attempt of “sentence-based” IME. The idea of “statistical input method” was proposed by modeling PTC conversion as a hidden Markov model (HMM), and using Viterbi [26] algorithm to decode the sequence. They solved the typo correction problem by decomposing the conditional probability of Chinese character sequence given pinyin sequence into a language model and a typing model . The typing model that was estimated on real user input data was for typo correction. However, real user input data can be very noisy and not very convenient to obtain. As we will propose a joint model in this paper, such an individual typing model is not necessarily built in our approach.
[44] developed an IME system with typo correction called CHIME using noisy channel error model and language-specific features. However their model depended on a very strong assumption that input pinyin sequence should have been segmented into pinyin words by the user. This assumption does not really hold in modern “sentence-based” IMEs. We release this assumption since our model solves segmentation, typo correction and PTC conversion jointly.
Besides the common HMM approach for PTC conversion, there are also various methods such as: support vector machine [16], maximum entropy (ME) model [30], conditional random field (CRF) [19] and statistical machine translation (SMT) [35, 29, 38], etc.
Spell checking or typo checking was first proposed for English [24]. [23] addressed that spell checking should be done within a context, i.e., a sentence or a long phrase with a certain meaning, instead of only in one word. A recent spell correction work is [20], where a distributional similarity was introduced for spell correction of web queries.
Early attempts for Chinese spelling checking could date back to [2] where character tables for similar shape, pronunciation, meaning, and input-method-code characters were proposed. More recently, the 7th SIGHAN Workshop on Chinese Language Processing [37] held a shared task on Chinese spell checking. Various approaches were made for the task including language model (LM) based methods [3], ME model [10], CRF [31, 27], SMT [5, 21], and graph model [14], etc.
It is a rather long journey from the first English letter typed on the keyboard to finally a completed Chinese sentence generated by IME. We will first take an overview of the entire process.
The average length of pinyin syllables is about 3 letters. There are about 410 pinyin syllables used in the current pinyin system. Each pinyin syllable has a bunch of corresponding Chinese characters which share the same pronunciation represented by the syllable. The number of those homophones ranges from 1 to over 300. Chinese characters then form words. But word in Chinese is a rather vague concept. Without word delimiters, linguists have argued on what a Chinese word really is for a long time and that is why there is always a primary word segmentation treatment in most Chinese language processing tasks [40, 13, 41, 39, 42, 43]. A Chinese word may contain from 1 to over 10 characters due to different word segmentation conventions. Figure 1 demonstrates the relationship of pinyin and word, from pinyin letters “nihao” to the word “\song你好 (hello)”. Typically, an IME takes the pinyin input, segments it into syllables, looks up corresponding words in a dictionary and generates a sentence with the candidate words.
Non-Chinese users may feel confused or even surprised if they know that when typing pinyin through an IME, Chinese IME users will never enter delimiters such as “Space” key to segment either pinyin syllables or pinyin words, but just input the entire un-segmented pinyin sequence. For example, if one wants to input “\song你好世界 (Hello world)”, he will just type “nihaoshijie” instead of segmented pinyin sequence “ni hao shi jie”. Nevertheless, pinyin syllable segmentation is a much easier problem compared to Chinese word segmentation. Since pinyin syllables have a very limited vocabulary and follow a set of regularities strictly, it is convenient to perform pinyin syllable segmentation by using rules. But as the pinyin input is not segmented, it is nearly impossible to adopt previous spell checking methods for English to pinyin typo checking, although techniques for English spell checking have been well developed. A bit confusing but interesting, pinyin typo correction and segmentation come as two sides of one problem: when a pinyin sequence is mistyped, it is unlikely to be correctly segmented; when it is segmented in an awkward way, it is likely to be mistyped.
Inspired by [36] and [14], we adopt the graph model for Chinese spell checking for pinyin segmentation and typo correction, which is based on the shortest path word segmentation algorithm [1]. The model has two major steps: segmentation and correction.
The shortest path segmentation algorithm is based on the idea that a reasonable segmentation should minimize the number of segmented units. For a pinyin sequence , where is a letter, first a directed acyclic graph (DAG) is built for pinyin segmentation step. The vertex set consists of the following parts:
Virtual start vertex and end vertex ;
Possible legal syllables fetched from dictionary according to the input pinyin sequence:
The letter itself as a fallback no matter if it is a legal pinyin syllable or not:
The vertex weights are all set to 0. The edges are from a syllable to all syllables next to it:
The edge weight the negative logarithm of conditional probability that a syllable is followed by , which is give by a bi-gram language model of pinyin syllables:
The shortest path on the graph is the path with the least sum of weights:
Computing the shortest path from to on yields the best segmentation. This is the single source shortest path (SSSP) problem on DAG which has an efficient algorithm by preprocessing the DAG with topology sort, then traversing vertices and edges in topological order. It has the time complexity of . For example, one intends to input “\song你好世界 (Hello world)” by typing “nihaoshijie”, but mistyped as “ mihaoshiji w”. The graph for this input is shown in Figure 2. The shortest path, i.e., the best segmentation is “mi hao shi ji w”. We will continue to use this example in the rest of this paper.
Next in the correction step, for the segmented pinyin sequence , a graph is constructed to perform typo correction. The vertex set consists of the following parts:
Virtual start vertex and end vertex with vertex weights of 0;
All possible syllables similar to original syllable in . If the adjacent syllables can be merged into a legal syllable, the merged syllable is also added into :
where the similarity is measured in Levenshtein distance [18]. Syllables with Levenshtein distance under a certain threshold are considered as similar:
The vertex weight is the Levenshtein distance multiply by a normalization parameter:
Similar to , the edges are from one syllable to all syllables next to it and edge weights are the conditional probabilities between them. Computing the shortest path from to on yields the best typo correction result. In addition, the result has been segmented so far. Considering our running example, the graph is shown in Figure 3, and the typo correction result is “mi hao shi jie”.
Merely using the above model, the typo correction result is not satisfying yet, no matter how much effort is paid. The major reason is that the basic semantic unit of Chinese language is actually word (tough vaguely defined) which is usually composed of several characters. Thus the conditional probability between characters does not make much sense. In addition, a pinyin syllable usually maps to dozens or even hundreds of corresponding homophonic characters, which makes the conditional probability between syllables much more noisy. However, using pinyin words instead of syllables is not a wise choice because pinyin word segmentation is not so easy a task as syllable segmentation. To make typo correction better, we consider to integrate it with PTC conversion using a joint model.
PTC conversion has long been viewed as a decoding problem using HMM. We continue to follow this formalization. The best Chinese character sequence for a given pinyin syllable sequence is the one with the highest conditional probability that
In the HMM for pinyin IME, observation states are pinyin syllables, hidden states are Chinese words, emission probability is , and transition probability is . Note the transition probability is the conditional probability between words instead of characters. PTC conversion is to decode the Chinese word sequence from the pinyin sequence. The Viterbi algorithm [26] is used for the decoding.
The shortest path algorithm for typo correction and Viterbi algorithm for PTC conversion are very closely related. It has been strictly proven by [8] that the sequence decoding problem on HMM is formally identical to finding a shortest path on a certain graph, which can be constructed in the following manner.
Consider a first order HMM with all possible observations , hidden states , a special start state , emission probabilities , transition probabilities , and start probabilities . For an observation sequence of time periods , the decoding problem is to find the best corresponding hidden state sequence with the highest probability, i.e.,
(1) |
Then we will construct a DAG upon the HMM. The vertex set includes:
Virtual start vertex and end vertex with vertex weight of 0;
Normal vertices , where , and . The vertex weight is the negative logarithm of emission probability:
The edge set includes:
Edges from the start vertex with edge weight
where ;
Edges to the end vertex with vertex weights of 0;
Edges between adjacent time periods with edge weight
where , and .
The shortest path from to is the one with the least sum of vertex and edge weights, i.e.,
(2) |
Given HMM decoding problem is identical to SSSP problem on DAG, we propose a joint graph model for PTC conversion with typo correction. The joint graph model aims to find the global optimal for both PTC conversion and typo correction on the entire input pinyin sequence. The graph is constructed based on graph for typo correction in Section 3.2. The vertex set consists of the following parts:
Virtual start vertex and end vertex with vertex weight of 0;
Adjacent pinyin syllables in are merged into pinyin words. Corresponding Chinese words are fetched from a PTC dictionary , which is a dictionary maps pinyin words to Chinese words, and added as vertices:
The vertex weight consists of two parts: 1. the vertex weights of syllables in , and 2. the emission probability:
If the corresponding pinyin syllables in have an edge between them, the vertices in also have an edge:
The edge weights are the negative logarithm of the transition probabilities:
Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order -gram LM, by tracking longer history while traversing the graph.
Computing the shortest path from to on yields the best pinyin-to-Chinese conversion with typo correction result. Considering our running example, the graph is shown in Figure 4.
The joint graph is rather huge and density. According to our empirical statistics, when setting threshold , for a sentence of characters, the joint graph will have , and .
To reduce the scale of graph , we filter graph by searching its -shortest paths first to get and construct on top of . Figure 5 shows the 3-shortest paths filtered graph and Figure 6 shows the corresponding for our running example. The scale of graph may be thus drastically reduced.
An efficient heap data structure is required in -shortest paths algorithm [7] for backtracking the best paths to current vertex while traversing. The heap is implemented as a priority queue of size sorted according to path length that should support efficient push and pop operations. Fibonacci heap [9] is adopted for the heap implementation since it has a push complexity of which is better than the for other heap structures.
Another benefit provided by -shortest paths is that it can be used for generating -best candidates of PTC conversion, which may be helpful for further performance improvement.
The corpus for evaluation is the one provided in [35], which is originally extracted from the People’s Daily corpus and labeled with pinyin. The corpus has already been split into training \mscTrain, development \mscDev and test \mscTest sets as shown in Table 1.
\mscTrain | \mscDev | \mscTest | |
---|---|---|---|
#Sentence | 1M | 2K | 100K |
#character | 43,679,593 | 83,765 | 4,123,184 |
SRILM [25] is adopted for language model training and KenLM [12, 11] for language model query. The Chinese part of the corpus is segmented into words before LM training. Maximum matching word segmentation is used with a large word vocabulary extracted from web data provided by [28]. The pinyin part is segmented according to the Chinese part. This vocabulary also serves as the PTC dictionary. The original vocabulary is not labeled with pinyin, thus we use the PTC dictionary of sunpinyin11http://code.google.com/p/sunpinyin/ which is an open source Chinese pinyin IME, to label the vocabulary with pinyin. The emission probabilities are estimated using the lexical translation module of MOSES [17] as “translation probability” from pinyin to Chinese.
We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy22 We only work on the PTC conversion part of IME, thus we are unable to use existing evaluation systems [15] for full Chinese IME functions..
Chinese characters in a sentence may be separated by digits, punctuation and alphabets which are directly inputted without the IME. We follow the so-called term Max Input Unit (MIU), the longest consecutive Chinese character sequence proposed by [15]. We will mainly consider MIU accuracy (MIU-Acc) which is the ratio of the number of completely corrected generated MIUs over the number of all MIUs, and character accuracy (Ch-Acc), but the sentence accuracy (S-Acc) will also be reported in evaluation results.
We will also report the conversion error rate (ConvER) proposed by [44], which is the ratio of the number of mistyped pinyin word that is not converted to the right Chinese word over the total number of mistyped pinyin words33 Other evaluation metrics are also proposed by [44] which is only suitable for their system since our system uses a joint model.
Firstly we build a baseline system without typo correction which is a pipeline of pinyin syllable segmentation and PTC conversion. The baseline system takes a pinyin input sequence, segments it into syllables, and then converts it to Chinese character sequence.
The pinyin syllable segmentation already has very high (over 98%) accuracy with a trigram LM using improved Kneser-Ney smoothing. According to our empirical observation, emission probabilities are mostly 1 since most Chinese words have unique pronunciation. So in this step we set . We consider different LM smoothing methods including Kneser-Ney (KN), improved Kneser-Ney (IKN), and Witten-Bell (WB). All of the three smoothing methods for bigram and trigram LMs are examined both using back-off models and interpolated models. The number of -best candidates for PTC conversion is set to 10. The results on \mscDev are shown in Figure 7 in which the “-i” suffix indicates using interpolated model. According to the results, we then choose the trigram LM using Kneser-Ney smoothing with interpolation.
The choice of the number of -best candidates for PTC conversion also has a strong impact on the results. Figure 8 shows the results on \mscDev with different s, of which the axis is drawn in logarithmic scale. We can observe that MIU-Acc slightly decreases while goes up, but Ch-Acc largely increases. We therefore choose as trade-off.
The parameter determines emission probability. Results with different on \mscDev is shown in Figure 9, of which the axis is drawn in logarithmic scale. is chosen at last.
We compare our baseline system with several practical pinyin IMEs including sunpinyin and Google Input Tools (Online version)44http://www.google.com/inputtools/try/. The results on \mscDev are shown in Table 2.
MIU-Acc | Ch-Acc | S-Acc | |
---|---|---|---|
Baseline | 73.39 | 96.24 | 38.00 |
sunpinyin | 52.37 | 87.51 | 13.95 |
74.74 | 94.81 | 40.2 | |
Yang-ME | - | 93.3 | 30.2 |
Yang-MT | - | 95.5 | 45.4 |
Based upon the baseline system, we build the joint system of PTC conversion with typo correction.
We simulate user typos by randomly generating errors automatically on the corpus. The typo rate is set according to previous Human-Computer Interaction (HCI) studies. Due to few works have been done on modeling Chinese text entry, we have to refer to those corresponding results on English [32, 22, 6], which show that the average typo rate is about 2%. [44] performed an experiment that 2,000 sentences of 11,968 Chinese words were entered by 5 native speakers. The collected data consists of 775 mistyped pinyin words caused by one edit operation, and 85 caused by two edit operations. As we observe on \mscTrain that the average pinyin word length is 5.24, then typo rate in the experiment of [44] can be roughly estimated as:
which is similar to the conclusion on English. Thus we generate corpora from \mscDev with typo rate of 0% (0-P), 2% (2-P), and 5% (5-P) to evaluate the system.
According to [44] most mistyped pinyin words are caused by one edit operation. Since pinyin syllable is much shorter than pinyin word, this ratio can be higher for pinyin syllables. From our statistics on \mscTrain, with 2% randomly generated typos, . Thus we set the threshold for to 2.
We first set -shortest paths filter to and tune . Results with different are shown in Figure 13.
[h]0.45
[h]0.45
[h]0.45
With , we select . Results with different are shown in Figure 17. We choose since there is no significant improvement when .
[h]0.43
[h]0.43
[h]0.43
The selection of also directly guarantees the running time of the joint model. With , on a normal PC with Intel Pentium Dual-Core E6700 CPU, the PTC conversion rate is over 2000 characters-per-minute (cpm), which is much faster than the normal typing rate of 200 cpm.
With all parameters optimized, results on \mscTest using the proposed joint model are shown in Table 3 and Table 4. Our results are compared to the baseline system without typo correction and Google Input Tool. Since sunpinyin does not have typo correction module and performs much poorer than our baseline system, we do not include it in the comparison. Though no direct proofs can be found to indicate if Google Input Tool has an independent typo correction component, its outputs show that such a component is unlikely available.
Since Google Input Tool has to be accessed through a web interface and the network connection cannot be guaranteed. we only take a subset of 10K sentences of \mscTest to perform the experiments, and the results are shown in Table 3.
The scores reported in [44] are not listed in Table 4 since the data set is different. They reported a ConvER of 53.56%, which is given here for reference.
Additionally, to further inspect the robustness of our model, performance with typo rate ranges from 0% to 5% is shown in Figure 18. Although the performance decreases while typo rate goes up, it is still quite satisfying around typo rate of 2% which is assumed to be the real world situation.
MIU-Acc | Ch-Acc | S-Acc | ConvER | |
Baseline 0-P | 79.90 | 97.47 | 48.87 | - |
Baseline 2-P | 50.47 | 90.53 | 11.12 | 99.95 |
Baseline 5-P | 30.26 | 82.83 | 3.32 | 99.99 |
Google 0-P | 79.08 | 95.26 | 46.83 | - |
Google 2-P | 49.47 | 61.50 | 11.08 | 91.70 |
Google 5-P | 29.18 | 36.20 | 3.29 | 94.64 |
Joint 0-P | 79.90 | 97.52 | 49.27 | - |
Joint 2-P | 75.55 | 95.40 | 40.69 | 18.45 |
Joint 5-P | 67.76 | 90.17 | 27.86 | 24.68 |
MIU-Acc | Ch-Acc | S-Acc | ConvER | |
Baseline 0-P | 74.46 | 96.42 | 40.50 | - |
Baseline 2-P | 47.25 | 89.50 | 9.62 | 99.95 |
Baseline 5-P | 28.28 | 81.74 | 2.63 | 99.98 |
Joint 2-P | 74.22 | 96.39 | 40.34 | - |
Joint 2-P | 69.91 | 94.14 | 33.11 | 21.35 |
Joint 5-P | 62.14 | 88.49 | 22.62 | 27.79 |
In this paper, we have developed a joint graph model for pinyin-to-Chinese conversion with typo correction. This model finds a joint global optimal for typo correction and PTC conversion on the entire input pinyin sequence. The evaluation results show that our model outperforms both previous academic systems and existing commercial products. In addition, the joint model is efficient enough for practical use.