A Joint Graph Model for Pinyin-to-Chinese Conversion
with Typo Correction^†^†thanks: This work was partially supported by the National Natural Science Foundation of China (Grant No.60903119, Grant No.61170114, and Grant No.61272248), the National Basic Research Program of China (Grant No.2013CB329401), the Science and Technology Commission of Shanghai Municipality (Grant No.13511500200), and the European Union Seventh Framework Program (Grant No.247619).

Zhongye Jia and Hai Zhao
MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems,
Center for Brain-Like Computing and Machine Intelligence
Department of Computer Science and Engineering, Shanghai Jiao Tong University
800 Dongchuan Road, Shanghai 200240, China
jia.zhongye@gmail.com, zhaohai@cs.sjtu.edu.cn
Corresponding author

Abstract

It is very import for Chinese language processing with the aid of an efficient input method engine (IME), of which pinyin-to-Chinese (PTC) conversion is the core part. Meanwhile, though typos are inevitable during user pinyin inputting, existing IMEs paid little attention to such big inconvenience. In this paper, motivated by a key equivalence of two decoding algorithms, we propose a joint graph model to globally optimize PTC and typo correction for IME. The evaluation results show that the proposed method outperforms both existing academic and commercial IMEs.

1 Introduction

1.1 Chinese Input Method

The daily life of Chinese people heavily depends on Chinese input method engine (IME), no matter whether one is composing an E-mail, writing an article, or sending a text message. However, every Chinese word inputted into computer or cellphone cannot be typed through one-to-one mapping of key-to-letter inputting directly, but has to go through an IME as there are thousands of Chinese characters for inputting while only 26 letter keys are available in the keyboard. An IME is an essential software interface that maps Chinese characters into English letter combinations. An efficient IME will largely improve the user experience of Chinese information processing.

Nowadays most of Chinese IMEs are pinyin based. Pinyin is originally designed as the phonetic symbol of a Chinese character (based on the standard modern Chinese, mandarin) , using Latin letters as its syllable notation. For example, the pinyin of the Chinese character “\song爱”(love) is “ài”. Most characters usually have unique pinyin representations, while a few Chinese characters may be pronounced in several different ways, so they may have multiple pinyin representations. The advantage of pinyin IME is that it only adopts the pronunciation perspective of Chinese characters so that it is simple and easy to learn. But there are only less than 500 pinyin syllables in standard modern Chinese, compared with over 6,000 commonly used Chinese characters, which leads to serious ambiguities for pinyin-to-character mapping. Modern pinyin IMEs mostly use a “sentence-based” decoding technique [4] to alleviate the ambiguities. “Sentence based” means that IME generates a sequence of Chinese characters upon a sequence of pinyin inputs with respect to certain statistical criteria.

1.2 Typos and Chinese Spell Checking

Written in Chinese characters but not alphabets, spell checking for Chinese language is quite different from the same task for other languages. Since Chinese characters are entered via IME, those user-made typos do not immediately lead to spelling errors. When a user types a wrong letter, IME will be very likely to fail to generate the expected Chinese character sequence. Normally, the user may immediately notice the inputting error and then make corrections, which usually means doing a bunch of extra operations like cursor movement, deletion and re-typing. Thus there are two separated sub-tasks for Chinese spell checking: 1. typo checking for user typed pinyin sequences which should be a built-in module in IME, and 2. spell checking for Chinese texts in its narrow sense, which is typically a module of word processing applications [36]. These two terms are often confused especially in IME related works such as [4] and [34].

Pinyin typos have always been a serious problem for Chinese pinyin IMEs. The user may fail to input the completely right pinyin simply because he/she is a dialect speaker and does not know the exact pronunciation for the expected character. This may be a very common situation since there are about seven quite different dialects in Chinese, among which being spoken languages, six are far different from the standard modern Chinese, mandarin. With the boom of smart-phones, pinyin typos worsen due to the limited size of soft keyboard, and the lack of physical feedback on the touch screen. However, existing practical IMEs only provide small patches to deal with typos such as Fuzzy Pinyin [33] and other language specific errors [45].

Typo checking and correction has an important impact on IME performance. When IME fails to correct a typo and generate the expected sentence, the user will have to take much extra effort to move the cursor back to the mistyped letter and correct it, which leads to very poor user experience [15].

2 Related Works

The very first approach for Chinese input with typo correction was made by [4], which was also the initial attempt of “sentence-based” IME. The idea of “statistical input method” was proposed by modeling PTC conversion as a hidden Markov model (HMM), and using Viterbi [26] algorithm to decode the sequence. They solved the typo correction problem by decomposing the conditional probability $P(H|P)$ of Chinese character sequence $H$ given pinyin sequence $P$ into a language model $P(w_{i}|w_{i-1})$ and a typing model $P(p_{i}|w_{i})$ . The typing model that was estimated on real user input data was for typo correction. However, real user input data can be very noisy and not very convenient to obtain. As we will propose a joint model in this paper, such an individual typing model is not necessarily built in our approach.

[44] developed an IME system with typo correction called CHIME using noisy channel error model and language-specific features. However their model depended on a very strong assumption that input pinyin sequence should have been segmented into pinyin words by the user. This assumption does not really hold in modern “sentence-based” IMEs. We release this assumption since our model solves segmentation, typo correction and PTC conversion jointly.

Besides the common HMM approach for PTC conversion, there are also various methods such as: support vector machine [16], maximum entropy (ME) model [30], conditional random field (CRF) [19] and statistical machine translation (SMT) [35, 29, 38], etc.

Spell checking or typo checking was first proposed for English [24]. [23] addressed that spell checking should be done within a context, i.e., a sentence or a long phrase with a certain meaning, instead of only in one word. A recent spell correction work is [20], where a distributional similarity was introduced for spell correction of web queries.

Early attempts for Chinese spelling checking could date back to [2] where character tables for similar shape, pronunciation, meaning, and input-method-code characters were proposed. More recently, the 7th SIGHAN Workshop on Chinese Language Processing [37] held a shared task on Chinese spell checking. Various approaches were made for the task including language model (LM) based methods [3], ME model [10], CRF [31, 27], SMT [5, 21], and graph model [14], etc.

3 Pinyin Input Method Model

3.1 From English Letter to Chinese Sentence

It is a rather long journey from the first English letter typed on the keyboard to finally a completed Chinese sentence generated by IME. We will first take an overview of the entire process.

The average length of pinyin syllables is about 3 letters. There are about 410 pinyin syllables used in the current pinyin system. Each pinyin syllable has a bunch of corresponding Chinese characters which share the same pronunciation represented by the syllable. The number of those homophones ranges from 1 to over 300. Chinese characters then form words. But word in Chinese is a rather vague concept. Without word delimiters, linguists have argued on what a Chinese word really is for a long time and that is why there is always a primary word segmentation treatment in most Chinese language processing tasks [40, 13, 41, 39, 42, 43]. A Chinese word may contain from 1 to over 10 characters due to different word segmentation conventions. Figure 1 demonstrates the relationship of pinyin and word, from pinyin letters “nihao” to the word “\song你好 (hello)”. Typically, an IME takes the pinyin input, segments it into syllables, looks up corresponding words in a dictionary and generates a sentence with the candidate words.

Figure 1: Relationship of pinyin and words

3.2 Pinyin Segmentation and Typo Correction

Non-Chinese users may feel confused or even surprised if they know that when typing pinyin through an IME, Chinese IME users will never enter delimiters such as “Space” key to segment either pinyin syllables or pinyin words, but just input the entire un-segmented pinyin sequence. For example, if one wants to input “\song你好世界 (Hello world)”, he will just type “nihaoshijie” instead of segmented pinyin sequence “ni hao shi jie”. Nevertheless, pinyin syllable segmentation is a much easier problem compared to Chinese word segmentation. Since pinyin syllables have a very limited vocabulary and follow a set of regularities strictly, it is convenient to perform pinyin syllable segmentation by using rules. But as the pinyin input is not segmented, it is nearly impossible to adopt previous spell checking methods for English to pinyin typo checking, although techniques for English spell checking have been well developed. A bit confusing but interesting, pinyin typo correction and segmentation come as two sides of one problem: when a pinyin sequence is mistyped, it is unlikely to be correctly segmented; when it is segmented in an awkward way, it is likely to be mistyped.

Inspired by [36] and [14], we adopt the graph model for Chinese spell checking for pinyin segmentation and typo correction, which is based on the shortest path word segmentation algorithm [1]. The model has two major steps: segmentation and correction.

3.2.1 Pinyin Segmentation

The shortest path segmentation algorithm is based on the idea that a reasonable segmentation should minimize the number of segmented units. For a pinyin sequence $p_{1}p_{2}\dots p_{L}$ , where $p_{i}$ is a letter, first a directed acyclic graph (DAG) $G_{S}=(\mathbb{V},\mathbb{E})$ is built for pinyin segmentation step. The vertex set $\mathbb{V}$ consists of the following parts:

•

Virtual start vertex $S_{0}$ and end vertex $S_{E}$ ;
•

Possible legal syllables fetched from dictionary $\mathbb{D}_{p}$ according to the input pinyin sequence:

$\{S_{i,j}|S_{i,j}=p_{i}\dots p_{j}\in\mathbb{D}_{p}\};$
•

The letter itself as a fallback no matter if it is a legal pinyin syllable or not:

$\{S_{i}|S_{i}=p_{i}\}.$

The vertex weights $w_{S}$ are all set to 0. The edges are from a syllable to all syllables next to it:

\mathbb{E}=\{E(S_{i,j}\to S_{j+1,k})|S_{i,j},S_{j+1,k}\in\mathbb{V}\}.

The edge weight the negative logarithm of conditional probability $P(S_{j+1,k}|S_{i,j})$ that a syllable $S_{i,j}$ is followed by $S_{j+1,k}$ , which is give by a bi-gram language model of pinyin syllables:

W_{E(S_{i,j}\to S_{j+1,k})}=-\log{P(S_{j+1,k}|S_{i,j})}

The shortest path $P^{*}$ on the graph is the path $P$ with the least sum of weights:

P^{*}=\operatorname*{arg\,min}_{(v,E)\in{G}\wedge(v,E)\in P}\sum_{v}{w_{v}}+% \sum_{E}{W_{E}}.

Computing the shortest path from $S_{0}$ to $S_{E}$ on $G_{S}$ yields the best segmentation. This is the single source shortest path (SSSP) problem on DAG which has an efficient algorithm by preprocessing the DAG with topology sort, then traversing vertices and edges in topological order. It has the time complexity of $O(|\mathbb{V}|+|\mathbb{E}|)$ . For example, one intends to input “\song你好世界 (Hello world)” by typing “nihaoshijie”, but mistyped as “ mihaoshiji w”. The graph for this input is shown in Figure 2. The shortest path, i.e., the best segmentation is “mi hao shi ji w”. We will continue to use this example in the rest of this paper.

Figure 2: Graph model for pinyin segmentation

3.2.2 Pinyin Typo Correction

Next in the correction step, for the segmented pinyin sequence $S_{1},S_{2},\ldots,S_{M}$ , a graph $G_{c}$ is constructed to perform typo correction. The vertex set $\mathbb{V}$ consists of the following parts:

•

Virtual start vertex $S^{\prime}_{0}$ and end vertex $S^{\prime}_{E}$ with vertex weights of 0;
•

All possible syllables similar to original syllable in $G_{s}$ . If the adjacent syllables can be merged into a legal syllable, the merged syllable is also added into $\mathbb{V}$ :

$\displaystyle\{S^{\prime}_{i,j}|$ $\displaystyle S^{\prime}_{i,j}=S^{\prime}_{i}\dots S^{\prime}_{j}\in\mathbb{D}% _{p},$

$\displaystyle S^{\prime}_{k}\sim S_{k},k=i\leq j\},$

where the similarity $\sim$ is measured in Levenshtein distance [18]. Syllables with Levenshtein distance under a certain threshold are considered as similar:

$\mathcal{L}(S_{i},S_{j})<T\leftrightarrow S_{i}\sim S_{j}.$

The vertex weight is the Levenshtein distance multiply by a normalization parameter:

$w_{S^{\prime}_{i,j}}=\beta\sum_{k-i}^{j}{\mathcal{L}(S^{\prime}_{k},S_{k})}.$

Similar to $G_{s}$ , the edges are from one syllable to all syllables next to it and edge weights are the conditional probabilities between them. Computing the shortest path from $S^{\prime}_{0}$ to $S^{\prime}_{E}$ on $G_{c}$ yields the best typo correction result. In addition, the result has been segmented so far. Considering our running example, the graph $G_{c}$ is shown in Figure 3, and the typo correction result is “mi hao shi jie”.

Figure 3: Graph model for pinyin typo correction

Merely using the above model, the typo correction result is not satisfying yet, no matter how much effort is paid. The major reason is that the basic semantic unit of Chinese language is actually word (tough vaguely defined) which is usually composed of several characters. Thus the conditional probability between characters does not make much sense. In addition, a pinyin syllable usually maps to dozens or even hundreds of corresponding homophonic characters, which makes the conditional probability between syllables much more noisy. However, using pinyin words instead of syllables is not a wise choice because pinyin word segmentation is not so easy a task as syllable segmentation. To make typo correction better, we consider to integrate it with PTC conversion using a joint model.

3.3 Hidden Markov Model for Pinyin-to-Chinese Conversion

PTC conversion has long been viewed as a decoding problem using HMM. We continue to follow this formalization. The best Chinese character sequence $W^{*}$ for a given pinyin syllable sequence $S$ is the one with the highest conditional probability $P(W|S)$ that

	$\displaystyle W^{*}$	$\displaystyle=\operatorname*{arg\,max}_{W}{P(W\|S)}$
		$\displaystyle=\operatorname*{arg\,max}_{W}{\frac{P(W)P(S\|W)}{P(S)}}$
		$\displaystyle=\operatorname*{arg\,max}_{W}{P(W)P(S\|W)}$
		$\displaystyle=\operatorname*{arg\,max}_{w_{1},w_{w},\ldots,w_{M}}{\prod_{w_{i}% }P(w_{i}\|w_{i-1})\prod_{w_{i}}P(s_{i}\|w_{i})}$

In the HMM for pinyin IME, observation states are pinyin syllables, hidden states are Chinese words, emission probability is $P(s_{i}|w_{i})$ , and transition probability is $P(w_{i}|w_{i-1})$ . Note the transition probability is the conditional probability between words instead of characters. PTC conversion is to decode the Chinese word sequence from the pinyin sequence. The Viterbi algorithm [26] is used for the decoding.

The shortest path algorithm for typo correction and Viterbi algorithm for PTC conversion are very closely related. It has been strictly proven by [8] that the sequence decoding problem on HMM is formally identical to finding a shortest path on a certain graph, which can be constructed in the following manner.

Consider a first order HMM with all possible observations $\mathbb{O}=\{o_{1},o_{2},\ldots,o_{M}\}$ , hidden states $\mathbb{S}=\{s_{1},s_{2},\ldots,s_{N}\}$ , a special start state $s_{0}$ , emission probabilities $(\mathcal{E}_{s_{i},o_{k}})=P(o_{k}|s_{i})$ , transition probabilities $(\mathcal{T}_{s_{i},s_{j}})=P(s_{j}|s_{i})$ , and start probabilities $(\mathcal{S}_{s_{i}})=P(s_{i}|s_{0})$ . For an observation sequence of $T$ time periods $Y=\{y_{1},y_{2},\ldots,y_{T}|y_{t}\in\mathbb{O},t=1,\ldots,T\}$ , the decoding problem is to find the best corresponding hidden state sequence $X^{*}$ with the highest probability, i.e.,

\displaystyle X^{*}=\operatorname*{arg\,max}_{x_{1},x_{t}\in\mathbb{S}}{% \mathcal{S}_{x_{1}}\mathcal{E}_{x_{1},y_{1}}\prod_{t=2}^{T}{\mathcal{E}_{x_{t}% ,y_{t}}\mathcal{T}_{x_{t-1},x_{t}}}}.

(1)

Then we will construct a DAG $G=(\mathbb{V},\mathbb{E})$ upon the HMM. The vertex set $\mathbb{V}$ includes:

•

Virtual start vertex $v_{0}$ and end vertex $v_{E}$ with vertex weight of 0;
•

Normal vertices $v_{x_{t}}$ , where $t=1,\ldots,T$ , and $\forall x_{t}\in\mathbb{S}$ . The vertex weight is the negative logarithm of emission probability:

$w_{v_{x_{t}}}=-\log{\mathcal{E}_{x_{t},y_{t}}}.$

The edge set $\mathbb{E}$ includes:

•

Edges from the start vertex $E(v_{0}\rightarrow v_{x_{1}})$ with edge weight

$W_{E(v_{0}\rightarrow v_{x_{1}})}=-\log{\mathcal{S}_{x_{1}}},$

where $\forall x_{1}\in\mathbb{S}$ ;
•

Edges to the end vertex $E(v_{x_{T}}\rightarrow v_{E})$ with vertex weights of 0;
•

Edges between adjacent time periods $E(v_{x_{t-1}}\rightarrow v_{x_{t}})$ with edge weight

$W_{E(v_{x_{t-1}}\rightarrow v_{x_{t}})}=-\log{\mathcal{T}_{x_{t-1},x_{t}}},$

where $t=2,\ldots,T$ , and $\forall x_{t},x_{t-1}\in\mathbb{S}$ .

The shortest path $P^{*}$ from $v_{0}$ to $v_{E}$ is the one with the least sum of vertex and edge weights, i.e.,

$\displaystyle P^{*}=$	$\displaystyle\operatorname*{arg\,min}_{v_{x_{t}}\in\mathbb{V}}\sum_{t=1}^{T}{(% w_{v_{x_{t}}}+W_{E(v_{x_{t-1}}\rightarrow v_{x_{t}})})}$
$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{v_{x_{1}},v_{x_{t}}\in\mathbb{V}}\{-% \log{\mathcal{S}_{x_{1}}}-\log\mathcal{E}_{x_{1},y_{1}}$
	$\displaystyle~{}~{}~{}~{}+\sum_{t=2}^{T}{(-\log\mathcal{E}_{x_{t},y_{t}}-\log% \mathcal{T}_{x_{t-1},x_{t}})}\}$
$\displaystyle=$	$\displaystyle\operatorname*{arg\,max}_{v_{x_{1}},v_{x_{t}}\in\mathbb{V}}% \mathcal{S}_{x_{1}}\mathcal{E}_{x_{1},y_{1}}\prod_{t=2}^{T}{\mathcal{E}_{x_{t}% ,y_{t}}\mathcal{T}_{x_{t-1},x_{t}}}.$	(2)

The optimization goal of $P^{*}$ in Equation (3.3) is identical to that of $X^{*}$ in Equation (1).

3.4 Joint Graph Model For Pinyin IME

Given HMM decoding problem is identical to SSSP problem on DAG, we propose a joint graph model for PTC conversion with typo correction. The joint graph model aims to find the global optimal for both PTC conversion and typo correction on the entire input pinyin sequence. The graph $G=(\mathbb{V},\mathbb{E})$ is constructed based on graph $G_{c}$ for typo correction in Section 3.2. The vertex set $\mathbb{V}$ consists of the following parts:

•

Virtual start vertex $V_{0}$ and end vertex $V_{E}$ with vertex weight of 0;
•

Adjacent pinyin syllables in $G_{c}$ are merged into pinyin words. Corresponding Chinese words are fetched from a PTC dictionary $\mathbb{D}_{c}$ , which is a dictionary maps pinyin words to Chinese words, and added as vertices:

$\{V_{i,j}|\forall V_{i,j}\in\mathbb{D}_{c}[S^{\prime}_{i}\dots S^{\prime}_{j}]% ,i\leq j\};$

The vertex weight consists of two parts: 1. the vertex weights of syllables in $G_{c}$ , and 2. the emission probability:

$\displaystyle w_{V_{i,j}}=$ $\displaystyle\beta\sum_{k=i}^{j}{\mathcal{L}(S^{\prime}_{k},S_{k})}$

$\displaystyle-\gamma\log{P(S^{\prime}_{i}\dots S^{\prime}_{j}|V_{i,j});}$

If the corresponding pinyin syllables in $G_{c}$ have an edge between them, the vertices in $G$ also have an edge:

\mathbb{E}=\{E(V_{i,j}\to V_{j+1,k})|E(S_{i,j}\to S_{j+1,k})\in\mathbb{G}_{c}\}.

The edge weights are the negative logarithm of the transition probabilities:

W_{E(V_{i,j}\to V_{j+1,k})}=-\log{P(V_{j+1,k}|V_{i,j})}

Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order $n$ -gram LM, by tracking longer history while traversing the graph.

Computing the shortest path from $V_{0}$ to $V_{E}$ on $G$ yields the best pinyin-to-Chinese conversion with typo correction result. Considering our running example, the graph $G$ is shown in Figure 4.

Figure 4: Joint graph model

The joint graph is rather huge and density. According to our empirical statistics, when setting threshold $T=2$ , for a sentence of $M$ characters, the joint graph will have $|\mathbb{V}|=M\times 1,000$ , and $|\mathbb{E}|=M\times 1,000,000$ .

3.5 $K$ -Shortest Paths

To reduce the scale of graph $G$ , we filter graph $G_{c}$ by searching its $K$ -shortest paths first to get $G^{\prime}_{c}$ and construct $G$ on top of $G^{\prime}_{c}$ . Figure 5 shows the 3-shortest paths filtered graph $G^{\prime}_{c}$ and Figure 6 shows the corresponding $G$ for our running example. The scale of graph may be thus drastically reduced.

Figure 5:

K

-shortest paths in typo correction

Figure 6: Filtered graph model

An efficient heap data structure is required in $K$ -shortest paths algorithm [7] for backtracking the best paths to current vertex while traversing. The heap is implemented as a priority queue of size $K$ sorted according to path length that should support efficient push and pop operations. Fibonacci heap [9] is adopted for the heap implementation since it has a push complexity of $O(1)$ which is better than the $O(K)$ for other heap structures.

Another benefit provided by $K$ -shortest paths is that it can be used for generating $N$ -best candidates of PTC conversion, which may be helpful for further performance improvement.

4 Experiments

4.1 Corpora, Tools and Experiment Settings

The corpus for evaluation is the one provided in [35], which is originally extracted from the People’s Daily corpus and labeled with pinyin. The corpus has already been split into training \mscTrain, development \mscDev and test \mscTest sets as shown in Table 1.

	\mscTrain	\mscDev	\mscTest
#Sentence	1M	2K	100K
#character	43,679,593	83,765	4,123,184

Table 1: Data set size

SRILM [25] is adopted for language model training and KenLM [12, 11] for language model query. The Chinese part of the corpus is segmented into words before LM training. Maximum matching word segmentation is used with a large word vocabulary $\mathcal{V}$ extracted from web data provided by [28]. The pinyin part is segmented according to the Chinese part. This vocabulary $\mathcal{V}$ also serves as the PTC dictionary. The original vocabulary is not labeled with pinyin, thus we use the PTC dictionary of sunpinyin¹¹http://code.google.com/p/sunpinyin/ which is an open source Chinese pinyin IME, to label the vocabulary $\mathcal{V}$ with pinyin. The emission probabilities are estimated using the lexical translation module of MOSES [17] as “translation probability” from pinyin to Chinese.

4.2 Evaluation Metrics

We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy²² We only work on the PTC conversion part of IME, thus we are unable to use existing evaluation systems [15] for full Chinese IME functions..

Chinese characters in a sentence may be separated by digits, punctuation and alphabets which are directly inputted without the IME. We follow the so-called term Max Input Unit (MIU), the longest consecutive Chinese character sequence proposed by [15]. We will mainly consider MIU accuracy (MIU-Acc) which is the ratio of the number of completely corrected generated MIUs over the number of all MIUs, and character accuracy (Ch-Acc), but the sentence accuracy (S-Acc) will also be reported in evaluation results.

We will also report the conversion error rate (ConvER) proposed by [44], which is the ratio of the number of mistyped pinyin word that is not converted to the right Chinese word over the total number of mistyped pinyin words³³ Other evaluation metrics are also proposed by [44] which is only suitable for their system since our system uses a joint model.

4.3 Baseline System without Typo Correction

Firstly we build a baseline system without typo correction which is a pipeline of pinyin syllable segmentation and PTC conversion. The baseline system takes a pinyin input sequence, segments it into syllables, and then converts it to Chinese character sequence.

The pinyin syllable segmentation already has very high (over 98%) accuracy with a trigram LM using improved Kneser-Ney smoothing. According to our empirical observation, emission probabilities are mostly 1 since most Chinese words have unique pronunciation. So in this step we set $\gamma=0$ . We consider different LM smoothing methods including Kneser-Ney (KN), improved Kneser-Ney (IKN), and Witten-Bell (WB). All of the three smoothing methods for bigram and trigram LMs are examined both using back-off models and interpolated models. The number of $N$ -best candidates for PTC conversion is set to 10. The results on \mscDev are shown in Figure 7 in which the “-i” suffix indicates using interpolated model. According to the results, we then choose the trigram LM using Kneser-Ney smoothing with interpolation.

Figure 7: MIU-Acc and Ch-Acc with different LM smoothing

The choice of the number of $N$ -best candidates for PTC conversion also has a strong impact on the results. Figure 8 shows the results on \mscDev with different $N$ s, of which the $N$ axis is drawn in logarithmic scale. We can observe that MIU-Acc slightly decreases while $N$ goes up, but Ch-Acc largely increases. We therefore choose $N=10$ as trade-off.

Figure 8: MIU-Acc and Ch-Acc with different

N

The parameter $\gamma$ determines emission probability. Results with different $\gamma$ on \mscDev is shown in Figure 9, of which the $\gamma$ axis is drawn in logarithmic scale. $\gamma=0.03$ is chosen at last.

Figure 9: MIU-Acc and Ch-Acc with different

\gamma

We compare our baseline system with several practical pinyin IMEs including sunpinyin and Google Input Tools (Online version)⁴⁴http://www.google.com/inputtools/try/. The results on \mscDev are shown in Table 2.

	MIU-Acc	Ch-Acc	S-Acc
Baseline	73.39	96.24	38.00
sunpinyin	52.37	87.51	13.95
Google	74.74	94.81	40.2
Yang-ME	-	93.3	30.2
Yang-MT	-	95.5	45.4

Table 2: Baseline system compared to other IMEs (%)

4.4 PTC Conversion with Typo Correction

Based upon the baseline system, we build the joint system of PTC conversion with typo correction.

We simulate user typos by randomly generating errors automatically on the corpus. The typo rate is set according to previous Human-Computer Interaction (HCI) studies. Due to few works have been done on modeling Chinese text entry, we have to refer to those corresponding results on English [32, 22, 6], which show that the average typo rate is about 2%. [44] performed an experiment that 2,000 sentences of 11,968 Chinese words were entered by 5 native speakers. The collected data consists of 775 mistyped pinyin words caused by one edit operation, and 85 caused by two edit operations. As we observe on \mscTrain that the average pinyin word length is 5.24, then typo rate in the experiment of [44] can be roughly estimated as:

\frac{775+85\times 2}{11968\times 5.24}=1.51\%,

which is similar to the conclusion on English. Thus we generate corpora from \mscDev with typo rate of 0% (0-P), 2% (2-P), and 5% (5-P) to evaluate the system.

According to [44] most mistyped pinyin words are caused by one edit operation. Since pinyin syllable is much shorter than pinyin word, this ratio can be higher for pinyin syllables. From our statistics on \mscTrain, with 2% randomly generated typos, $Pr(\mathcal{L}(S^{\prime},S)<2)=99.86\%$ . Thus we set the threshold $T$ for $\mathcal{L}$ to 2.

We first set $K$ -shortest paths filter to $K=10$ and tune $\beta$ . Results with different $\beta$ are shown in Figure 13.

{subfigure}

[h]0.45

Figure 10: 0-P

{subfigure}

[h]0.45

Figure 11: 2-P

{subfigure}

[h]0.45

Figure 12: 5-P

Figure 13: MIU-Acc and Ch-Acc with different

\beta

With $\beta=3.5$ , we select $K$ . Results with different $K$ are shown in Figure 17. We choose $K=20$ since there is no significant improvement when $K>20$ .

{subfigure}

[h]0.43

Figure 14: 0-P

{subfigure}

[h]0.43

Figure 15: 2-P

{subfigure}

[h]0.43

Figure 16: 5-P

Figure 17: MIU-Acc and Ch-Acc with different

K

The selection of $K$ also directly guarantees the running time of the joint model. With $K=20$ , on a normal PC with Intel Pentium Dual-Core E6700 CPU, the PTC conversion rate is over 2000 characters-per-minute (cpm), which is much faster than the normal typing rate of 200 cpm.

With all parameters optimized, results on \mscTest using the proposed joint model are shown in Table 3 and Table 4. Our results are compared to the baseline system without typo correction and Google Input Tool. Since sunpinyin does not have typo correction module and performs much poorer than our baseline system, we do not include it in the comparison. Though no direct proofs can be found to indicate if Google Input Tool has an independent typo correction component, its outputs show that such a component is unlikely available.

Since Google Input Tool has to be accessed through a web interface and the network connection cannot be guaranteed. we only take a subset of 10K sentences of \mscTest to perform the experiments, and the results are shown in Table 3.

The scores reported in [44] are not listed in Table 4 since the data set is different. They reported a ConvER of 53.56%, which is given here for reference.

Additionally, to further inspect the robustness of our model, performance with typo rate ranges from 0% to 5% is shown in Figure 18. Although the performance decreases while typo rate goes up, it is still quite satisfying around typo rate of 2% which is assumed to be the real world situation.

	MIU-Acc	Ch-Acc	S-Acc	ConvER
Baseline 0-P	79.90	97.47	48.87	-
Baseline 2-P	50.47	90.53	11.12	99.95
Baseline 5-P	30.26	82.83	3.32	99.99
Google 0-P	79.08	95.26	46.83	-
Google 2-P	49.47	61.50	11.08	91.70
Google 5-P	29.18	36.20	3.29	94.64
Joint 0-P	79.90	97.52	49.27	-
Joint 2-P	75.55	95.40	40.69	18.45
Joint 5-P	67.76	90.17	27.86	24.68

Table 3: Test results on 10K sentences from \mscTest (%)

	MIU-Acc	Ch-Acc	S-Acc	ConvER
Baseline 0-P	74.46	96.42	40.50	-
Baseline 2-P	47.25	89.50	9.62	99.95
Baseline 5-P	28.28	81.74	2.63	99.98
Joint 2-P	74.22	96.39	40.34	-
Joint 2-P	69.91	94.14	33.11	21.35
Joint 5-P	62.14	88.49	22.62	27.79

Table 4: Test results on \mscTest (%)

Figure 18: MIU-Acc and Ch-Acc with different typo rate (%)

5 Conclusion

In this paper, we have developed a joint graph model for pinyin-to-Chinese conversion with typo correction. This model finds a joint global optimal for typo correction and PTC conversion on the entire input pinyin sequence. The evaluation results show that our model outperforms both previous academic systems and existing commercial products. In addition, the joint model is efficient enough for practical use.

References

[1] R. G. Casey and E. Lecolinet(1996) A Survey of Methods and Strategies in Character Segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 18 (7), pp. 690–706. Cited by: 3.2.
[2] C. Chang(1994) A Pilot Study on Automatic Chinese Spelling Error Correction. Journal of Chinese Language and Computing 4, pp. 143â149. Cited by: 2.
[3] K. Chen, H. Lee, C. Lee, H. Wang and H. Chen(2013-10) A Study of Language Modeling for Chinese Spelling Check. Nagoya, Japan, pp. 79–83. External Links: Link Cited by: 2.
[4] Z. Chen and K. Lee(2000-10) A New Statistical Approach To Chinese Pinyin Input. Hong Kong, pp. 241–247. External Links: Link Cited by: 1.1, 1.2, 2.
[5] H. Chiu, J. Wu and J. S. Chang(2013-10) Chinese Spelling Checker Based on Statistical Machine Translation. Nagoya, Japan, pp. 49–53. External Links: Link Cited by: 2.
[6] E. Clarkson, J. Clawson, K. Lyons and T. Starner(2005) An Empirical Study of Typing Rates on mini-QWERTY Keyboards. CHI EA ’05, New York, NY, USA, pp. 1288–1291. External Links: ISBN 1-59593-002-7, Link, Document Cited by: 4.4.
[7] D. Eppstein(1998) Finding the K Shortest Paths. SIAM Journal on computing 28 (2), pp. 652–673. Cited by: 3.5.
[8] J. G. D. Forney(1973) The Viterbi Algorithm. Proceedings of the IEEE 61 (3), pp. 268–278. Cited by: 3.3.
[9] M. L. Fredman and R. E. Tarjan(1987-07) Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms. Journal of the ACM (JACM) 34 (3), pp. 596–615. External Links: ISSN 0004-5411, Link, Document Cited by: 3.5.
[10] D. Han and B. Chang(2013-10) A Maximum Entropy Approach to Chinese Spelling Check. Nagoya, Japan, pp. 74–78. External Links: Link Cited by: 2.
[11] K. Heafield, I. Pouzyrevsky, J. H. Clark and P. Koehn(2013-08) Scalable Modified Kneser-Ney Language Model Estimation. Sofia, Bulgaria, pp. 690–696. External Links: Link Cited by: 4.1.
[12] K. Heafield(2011-07) KenLM: Faster and Smaller Language Model Queries. Edinburgh, Scotland, United Kingdom, pp. 187–197. External Links: Link Cited by: 4.1.
[13] C. Huang and H. Zhao(2007) Chinese Word Segmentation: A Decade Review. Journal of Chinese Information Processing 21 (3), pp. 8–20. Cited by: 3.1.
[14] Z. Jia, P. Wang and H. Zhao(2013-10) Graph Model for Chinese Spell Checking. Nagoya, Japan, pp. 88–92. External Links: Link Cited by: 2, 3.2.
[15] Z. Jia and H. Zhao(2013-10) KySS 1.0: a Framework for Automatic Evaluation of Chinese Input Method Engines. Nagoya, Japan, pp. 1195–1201. External Links: Link Cited by: 1.2, 4.2, 4.2.
[16] W. Jiang, Y. Guan, X. Wang and B. Liu(2007) PinYin-to-Character Conversion Model based on Support Vector Machines. Journal of Chinese information processing 21 (2), pp. 100–105. Cited by: 2.
[17] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst(2007-06) Moses: Open Source Toolkit for Statistical Machine Translation. Prague, Czech Republic, pp. 177–180. External Links: Link Cited by: 4.1.
[18] V. I. Levenshtein(1966) Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Vol. 10, pp. 707. Cited by: 3.2.2.
[19] L. Li, X. Wang, X. Wang and Y. Yu(2009) A Conditional Random Fields Approach to Chinese Pinyin-to-Character Conversion. Journal of Communication and Computer 6 (4), pp. 25–31. Cited by: 2.
[20] M. Li, M. Zhu, Y. Zhang and M. Zhou(2006-07) Exploring Distributional Similarity Based Models for Query Spelling Correction. Sydney, Australia, pp. 1025–1032. External Links: Link, Document Cited by: 2.
[21] X. Liu, K. Cheng, Y. Luo, K. Duh and Y. Matsumoto(2013-10) A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking. Nagoya, Japan, pp. 54–58. External Links: Link Cited by: 2.
[22] I. S. MacKenzie and R. W. Soukoreff(2002) A Character-level Error Analysis Technique for Evaluating Text Entry Methods. NordiCHI ’02, New York, NY, USA, pp. 243–246. External Links: ISBN 1-58113-616-1, Link, Document Cited by: 4.4.
[23] E. Mays, F. J. Damerau and R. L. Mercer(1991) Context Based Spelling Correction. Information Processing & Management 27 (5), pp. 517–522. Cited by: 2.
[24] J. L. Peterson(1980-12) Computer Programs for Detecting and Correcting Spelling Errors. Commun. ACM 23 (12), pp. 676–687. External Links: ISSN 0001-0782, Link, Document Cited by: 2.
[25] A. Stolcke(2002) SRILM-An Extensible Language Modeling Toolkit. Vol. 2, pp. 901–904. Cited by: 4.1.
[26] A. J. Viterbi(1967) Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. Information Theory, IEEE Transactions on 13 (2), pp. 260–269. External Links: Document, ISSN 0018-9448 Cited by: 2, 3.3.
[27] C. Wang, J. S. Chang and J. Wu(2013-10) Automatic Chinese Confusion Words Extraction Using Conditional Random Fields and the Web. Nagoya, Japan, pp. 64–68. External Links: Link Cited by: 2.
[28] P. Wang, R. Sun, H. Zhao and K. Yu(2013) A New Word Language Model Evaluation Metric for Character Based Languages. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 315–324. Cited by: 4.1.
[29] R. Wang, M. Utiyama, I. Goto, E. Sumita, H. Zhao and B. Lu(2013-10) Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation. Seattle, Washington, USA, pp. 845–850. External Links: Link Cited by: 2.
[30] X. Wang, L. Li, L. Yao and W. Anwar(2006) A Maximum Entropy Approach to Chinese Pin Yin-To-Character Conversion. Vol. 4, pp. 2956–2959. Cited by: 2.
[31] Y. Wang, Y. Liao, Y. Wu and L. Chang(2013-10) Conditional Random Field-based Parser and Language Model for Tradi-tional Chinese Spelling Checker. Nagoya, Japan, pp. 69–73. External Links: Link Cited by: 2.
[32] J. O. Wobbrock and B. A. Myers(2006-12) Analyzing the Input Stream for Character- Level Errors in Unconstrained Text Entry Evaluations. ACM Trans. Comput.-Hum. Interact. 13 (4), pp. 458–489. External Links: ISSN 1073-0516, Link, Document Cited by: 4.4.
[33] J. Wu and L. Chen(2004-August 25) Fault-tolerant Romanized Input Method for Non-roman Characters. Google Patents. Note: US Patent App. 10/928,131 Cited by: 1.2.
[34] J. Wu, H. Zhu and H. Zhu(2009-January 13) Systems and Methods for Translating Chinese Pinyin to Chinese Characters. Google Patents. Note: US Patent 7,478,033 Cited by: 1.2.
[35] S. Yang, H. Zhao and B. Lu(2012-11) A Machine Translation Approach for Chinese Whole-Sentence Pinyin-to-Character Conversion. Bali,Indonesia, pp. 333–342. External Links: Link Cited by: 2, 4.1.
[36] S. Yang, H. Zhao, X. Wang and B. Lu(2012-05) Spell Checking for Chinese. Istanbul, Turkey, pp. 730–736. Cited by: 1.2, 3.2.
[37] L. Yu, Y. Tseng, J. Zhu and F. Ren (Eds.)(2013-10) Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, Nagoya, Japan. External Links: Link Cited by: 2.
[38] J. Zhang and H. Zhao(2013) Improving Function Word Alignment with Frequency and Syntactic Information. pp. 2211–2217. Cited by: 2.
[39] H. Zhao, C. Huang, M. Li and B. Lu(2010) A Unified Character-Based Tagging Framework for Chinese Word Segmentation. ACM Transactions on Asian Language Information Processing (TALIP) 9 (2), pp. 5. Cited by: 3.1.
[40] H. Zhao, C. Huang and M. Li(2006-07) An Improved Chinese Word Segmentation System with Conditional Random Field. Sydney, Australia, pp. 162–165. External Links: Link Cited by: 3.1.
[41] H. Zhao and C. Kit(2008) Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation. Research in Computing Science 33, pp. 93–104. Cited by: 3.1.
[42] H. Zhao and C. Kit(2011) Integrating Unsupervised and Supervised Word Segmentation: The Role of Goodness Measures. Information Sciences 181 (1), pp. 163–183. Cited by: 3.1.
[43] H. Zhao, M. Utiyama, E. Sumita and B. Lu(2013) An Empirical Study on Word Segmentation for Chinese Machine Translation. Computational Linguistics and Intelligent Text Processing, pp. 248–263. Cited by: 3.1.
[44] Y. Zheng, C. Li and M. Sun(2011) CHIME: An Efficient Error-tolerant Chinese Pinyin Input Method. IJCAI’11, pp. 2551–2556. External Links: ISBN 978-1-57735-515-1, Link, Document Cited by: 2, 4.2, 4.4, 4.4, 4.4.
[45] Y. Zheng, L. Xie, Z. Liu, M. Sun, Y. Zhang and L. Ru(2011-06) Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method. Portland, Oregon, USA, pp. 485–490. External Links: Link Cited by: 1.2.

Generated on Tue Jun 10 19:35:27 2014 by LaTeXML [LOGO]

	$\displaystyle W^{*}$	$\displaystyle=\operatorname*{arg\,max}_{W}{P(W\|S)}$
		$\displaystyle=\operatorname*{arg\,max}_{W}{\frac{P(W)P(S\|W)}{P(S)}}$
		$\displaystyle=\operatorname*{arg\,max}_{W}{P(W)P(S\|W)}$
		$\displaystyle=\operatorname*{arg\,max}_{w_{1},w_{w},\ldots,w_{M}}{\prod_{w_{i}% }P(w_{i}\|w_{i-1})\prod_{w_{i}}P(s_{i}\|w_{i})}$

	$\displaystyle\{S^{\prime}_{i,j}\|$	$\displaystyle S^{\prime}_{i,j}=S^{\prime}_{i}\dots S^{\prime}_{j}\in\mathbb{D}% _{p},$
		$\displaystyle S^{\prime}_{k}\sim S_{k},k=i\leq j\},$

	$\displaystyle w_{V_{i,j}}=$	$\displaystyle\beta\sum_{k=i}^{j}{\mathcal{L}(S^{\prime}_{k},S_{k})}$
		$\displaystyle-\gamma\log{P(S^{\prime}_{i}\dots S^{\prime}_{j}\|V_{i,j});}$

Abstract

1 Introduction

1.1 Chinese Input Method

1.2 Typos and Chinese Spell Checking

2 Related Works

3 Pinyin Input Method Model

3.1 From English Letter to Chinese Sentence

3.2 Pinyin Segmentation and Typo Correction

3.2.1 Pinyin Segmentation

3.2.2 Pinyin Typo Correction

3.3 Hidden Markov Model for Pinyin-to-Chinese Conversion

3.4 Joint Graph Model For Pinyin IME

3.5 K-Shortest Paths

4 Experiments

4.1 Corpora, Tools and Experiment Settings

4.2 Evaluation Metrics

4.3 Baseline System without Typo Correction

4.4 PTC Conversion with Typo Correction

5 Conclusion

References

3.5 $K$ -Shortest Paths