Noisy channel models, widely used in modern spellers, cope with typical misspellings, but do not work well with infrequent and difficult spelling errors. In this paper, we have improved the noisy channel approach by iterative stochastic search for the best correction. The proposed algorithm allowed us to avoid local minima problem and improve the measure by 6.6% on distant spelling errors.
A speller is an essential part of any program associated with text input and processing — e-mail system, search engine, browser, form editor etc. To detect and correct spelling errors, the state of the art spelling correction systems use the noisy channel approach [6, 8, 1]. Its models are usually trained on large corpora and provide high effectiveness in correction of typical errors (most of which consist of 1-2 wrong characters per word), but does not work well for complex (multi-character) and infrequent errors.
In this paper, we improved effectiveness of the noisy channel for the correction of complex errors. In most cases, these are cognitive errors in loan words (folsvagen volkswagen), names of drugs (vobemzin wobenzym), names of brands (scatcher skechers), scientific terms (heksagidron hexahedron) and last names (Shwartzneger Schwarzenegger). In all these cases, the misspelled word contains many errors and the corresponding error model penalty cannot be compensated by the LM weight of its proper form. As a result, either the misspelled word itself, or the other (less complicated, more frequent) misspelling of the same word wins the likelihood race.
To compensate for this defect of the noisy channel, the iterative approach [3] is typically used. The search for the best variant is repeated several times, what allows correcting rather complex errors, but does not completely solve the problem of falling into local minima. To overcome this issue we suggest to consider more correction hypotheses. For this purpose we used a method based on the simulated annealing algorithm. We experimentally demonstrate that the proposed method outperforms the baseline noisy channel and iterative spellers.
Many authors employ machine learning to build rankers that compensate for the drawbacks of the noisy channel model: [9, 5]. These techniques can be combined with the proposed method by replacing posterior probability of single correction in our method with an estimate obtained via discriminative training method.
In our work, we focus on isolated word-error correction [7], which, in a sense, is a harder task, than multi-word correction, because there is no context available for misspelled words. For experiments we used single-word queries to a commercial search engine.
Noisy channel is a probabilistic model that defines posterior probability of being the intended word, given the observed word ; for such model, the optimal decision rule is the following:
(1) |
where is the source (language) model, and is the error model. Given defined, to correct the word we could iterate through all ever-observed words, and choose the one, that maximizes the posterior probability. However, the practical considerations demand that we do not rank the whole list of words, but instead choose between a limited number of hypotheses :
Given , generate a set of hypotheses , such that
(2) |
Choose the hypothesis that maximizes .
If hypotheses constitute a major part of the posterior probability mass, it is highly unlikely that the intended word is not among them.
In baseline speller we use a substring-based error model described in [1], the error model training method and the hypotheses generator are similar to [4].
For building language () and error () models, we use words collected from the 6-months query log of a commercial search engine.
Hypotheses generator is based on A* beam search in a trie of words, and yields hypotheses , for which the noisy channel scores are highest possible. Hypotheses generator has high K-best recall (see Section 4.2) — in 91.8% cases the correct hypothesis is found when , which confirms the assumption about covering almost all posterior probability mass (see Equation 2).
While choosing of the posterior probability is an optimal decision rule in theory, in practice it might not be optimal, due to limitations of the language and error modeling. For example, vobemzin is corrected to more frequent misspelling vobenzin (instead of correct form wobenzym) by the noisy channel, because is too low (see Table 1).
vobenzin | 2.289 | 31.75 | 34.04 |
wobenzym | 12.52 | 26.02 | 38.54 |
There have been attempts [3] to apply other rules, which would overcome limitations of language and error models with compensating changes described further.
Iterative spelling correction with iterations uses standard noisy-channel to correct the query repeatedly times. It is motivated by the assumption, that we are more likely to successfully correct the query if we take several short steps instead of one big step [3] .
Iterative correction is hill climbing in the space of possible corrections: on each iteration we make a transition to the best point in the neighbourhood, i.e. to correction, that has maximal posterior probability . As any local search method, iterative correction is prone to local minima, stopping before reaching the correct word.
A common method of avoiding local minima in optimization is the simulated annealing algorithm, key ideas from which can be adapted for spelling correction task. In this section we propose such an adaptation. Consider: we do not always transition deterministically to the next best correction, but instead transition randomly to a (potentially any) correction with transition probability being equal to the posterior , where is the correction we transition from, is the correction we transition to, and is defined by Equation 1. Iterative correction then turns into a random walk: we start at word and stop after random steps at some word , which becomes our answer.
To turn random walk into deterministic spelling correction algorithm, we de-randomize it, using the following transformation. Described random walk defines, for each word , a probability of ending up in after starting a walk from the initial query . With that probability defined, our correction algorithm is the following: given query , pick as a correction.
Probability of getting from to some is a sum, over all possible paths, of probabilities of getting from to via specific path :
(3) |
(4) |
where is the set of all possible words, and is the probability of observing as a query in the noisy-channel model.
Example: if we start a random walk from vobemzin and make 3 steps, we most probably will end up in the correct form wobenzym with . A few of the most probable random walk paths are shown in Table 2. Note, that despite the fact that most probable path does not lead to the correct word, many other paths to wobenzym sum up to 0.361, which is greater than probability of any other word. Also note, that the method works only because multiple misspellings of the same word are presented in our model; for related research see [2].
vobemzinvobenzinvobenzinvobenzin | 0.074 |
vobemzinvobenzimwobenzymwobenzym | 0.065 |
vobemzinvobenzinvobenzimvobenzim | 0.052 |
vobemzinvobenzimvobenzimwobenzym | 0.034 |
vobemzinwobenzymwobenzymwobenzym | 0.031 |
vobemzinwobenzimwobenzymwobenzym | 0.028 |
vobemzinwobenzynwobenzymwobenzym | 0.022 |
This section describes some common heuristic improvements, that, where possible, were applied both to the baseline methods and to the proposed algorithm.
Basic building block of every mentioned algorithm is one-step noisy-channel correction. Each basic correction proceeds as described in Section 2.1: a small number of hypotheses is generated for the query , hypotheses are scored, and scores are recomputed into normalized posterior probabilities (see Equation 5). Posterior probabilities are then either used to pick the best correction (in baseline and simple iterative correction), or are accumulated to later compute the score defined by Equation 3.
(5) |
A standard log-linear weighing trick was applied to noisy-channel model components, see e.g. [9]. is the parameter that controls the trade-off between precision and recall (see Section 4.2) by emphasizing the importance of either the high frequency of the correction or its proximity to the query.
We have also found, that resulting posterior probabilities emphasize the best hypothesis too much: best hypothesis gets almost all probability mass and other hypotheses get none. To compensate for that, posteriors were smoothed by raising each probability to some power and re-normalizing them afterward:
(6) |
In a sense, is like temperature parameter in simulated annealing – it controls the entropy of the walk and the final probability distribution. Unlike in simulated annealing, we fix for all iterations of the algorithm.
Finally, if posterior probability of the best hypothesis was lower than threshold , then the original query was used as the spell-checker output. (Posterior is defined by Equation 6 for the baseline and simple iterative methods and by Equations 3 and 6 for the proposed method). Parameter controls precision/recall trade-off (as well as mentioned above).
To evaluate the proposed algorithm we have collected two datasets. Both datasets were randomly sampled from single-word user queries from the 1-week query log of a commercial search engine. We annotated them with the help of professional analyst. The difference between datasets is that one of them contained only queries with low search performance: for which the number of documents retrieved by the search engine was less than a fixed threshold (we will address it as the ”hard” dataset), while the other dataset had no such restrictions (we will call it ”common”). Dataset statistics are shown in Table 3.
Dataset | Queries | Misspelled | Avg. |
---|---|---|---|
Common | 2240 | 224 (10%) | 5.98 |
Hard | 2542 | 1484 (58%) | 9.23 |
Increased average error model score and error rate of ”common” dataset compared to ”hard” shows, that we have indeed managed to collect hard-to-correct queries in the ”hard” dataset.
First of all, we evaluated the recall of hypotheses generator using K-best recall — the number of correct spelling corrections for misspelled queries among hypotheses divided by the total number of misspelled queries in the test set. Resulting recall with is 91.8% on ”hard” and 98.6% on ”common”.
Next, three spelling correction methods were tested: noisy channel, iterative correction and our method (stochastic iterative correction).
For evaluation of spelling correction quality, we use the following metrics:
Precision: The number of correct spelling corrections for misspelled words generated by the system divided by the total number of corrections generated by the system;
Recall: The number of correct spelling corrections for misspelled words generated by the system divided by the total number of misspelled words in the test set;
For hypotheses generator, was fixed: recall of 91.8% was considered big enough. Precision/recall tradeoff parameters and (they are applicable to each method, including baseline) were iterated by the grid , and (applicable to iterative and our method) and (just our method) were iterated by the grid ; for each set of parameters, precision and recall were measured on both datasets. Pareto frontiers for precision and recall are shown in Figures 1 and 2.
We were not able to reproduce superior performance of the iterative method over the noisy channel, reported by [3]. Supposedly, it is because the iterative method benefits primarily from the sequential application of split/join operations altering query decomposition into words; since we are considering only one-word queries, such decomposition does not matter.
On the ”hard” dataset the performance of the noisy channel and the iterative methods is inferior to our proposed method, see Figure 1. We tested all three methods on the ”common” dataset as well to evaluate if our handling of hard cases affects the performance of our approach on the common cases of spelling error. Our method performs well on the common cases as well, as Figure 2 shows. The performance comparison for the ”common” dataset shows comparable performance for all considered methods.
Noisy channel and iterative methods’ frontiers are considerably inferior to the proposed method on ”hard” dataset, which means that our method works better. The results on ”common” dataset show, that the proposed method doesn’t work worse than baseline.
Next, we optimized parameters for each method and each dataset separately to achieve the highest measure. Results are shown in Tables 4 and 5. We can see, that, given the proper tuning, our method can work better on any dataset (but it cannot achieve the best performance on both datasets at once). See Tables 4 and 5 for details.
Method | |||||
---|---|---|---|---|---|
Noisy channel | 0.6 | 0.1 | - | - | 55.8 |
Iterative | 0.6 | 0.1 | - | 2 | 55.9 |
Stochastic iterative | 0.9 | 0.2 | 0.35 | 3 | 62.5 |
Method | |||||
---|---|---|---|---|---|
Noisy channel | 0.75 | 0.225 | - | - | 62.06 |
Iterative | 0.8 | 0.275 | - | 2 | 63.15 |
Stochastic iterative | 1.2 | 0.4 | 0.35 | 3 | 63.9 |
Next, each parameter was separately iterated (by a coarser grid); initial parameters for each method were taken from Table 4. Such iteration serves two purposes: to show the influence of parameters on algorithm performance, and to show differences between datasets: in such setup parameters are virtually tuned using ”hard” dataset and evaluated using ”common” dataset. Results are shown in Table 6.
, common | , hard | ||||||
N.ch. | It. | Our | N.ch. | It. | Our | ||
= 0.5 | 45.3 | 45.9 | 37.5 | 54.9 | 54.8 | 50.0 | |
0.6 | 49.9 | 50.5 | 41.5 | 55.8 | 55.9 | 56.6 | |
0.7 | 50.4 | 50.4 | 44.1 | 54.5 | 55.1 | 59.6 | |
0.8 | 52.7 | 52.7 | 46.0 | 52.6 | 53.0 | 61.5 | |
0.9 | 53.5 | 53.5 | 49.3 | 50.3 | 50.6 | 62.5 | |
1.0 | 55.4 | 55.0 | 50.9 | 47.0 | 47.3 | 61.8 | |
1.1 | 53.7 | 53.4 | 52.7 | 44.3 | 44.6 | 60.8 | |
1.2 | 52.5 | 52.5 | 53.7 | 41.9 | 42.3 | 58.8 | |
1.3 | 52.2 | 52.6 | 54.6 | 39.5 | 39.9 | 56.6 | |
1.4 | 51.4 | 51.8 | 55.0 | 36.8 | 37.3 | 53.6 | |
= 0 | 41.0 | 41.5 | 33.0 | 52.9 | 53.1 | 58.3 | |
0.1 | 49.9 | 50.6 | 35.6 | 55.8 | 55.9 | 59.7 | |
0.15 | 59.4 | 59.8 | 43.2 | 55.8 | 55.6 | 61.6 | |
0.2 | 60.8 | 61.3 | 49.4 | 51.0 | 51.0 | 62.5 | |
0.25 | 54.0 | 54.0 | 54.9 | 46.3 | 46.3 | 61.1 | |
0.3 | 46.3 | 46.3 | 57.3 | 39.2 | 39.2 | 58.4 | |
0.4 | 25.8 | 25.8 | 53.9 | 22.3 | 22.3 | 50.3 | |
= 2 | 50.6 | 53.6 | 55.9 | 60.4 | |||
3 | 50.6 | 49.4 | 55.9 | 62.5 | |||
4 | 50.6 | 46.4 | 55.9 | 62.1 | |||
5 | 50.6 | 46.7 | 55.9 | 60.1 | |||
= 0.1 | 10.1 | 6.0 | |||||
0.2 | 49.4 | 51.5 | |||||
0.3 | 51.4 | 61.4 | |||||
0.35 | 49.4 | 62.5 | |||||
0.4 | 47.5 | 62.0 | |||||
0.45 | 45.8 | 60.8 | |||||
0.5 | 45.2 | 60.3 |
The proposed method is able to successfully correct distant spelling errors with edit distance of 3 characters (see Table 7).
Query | Noisy channel | Proposed method |
---|---|---|
akwamarin | akvamarin | aquamarine |
maccartni | maccartni | mccartney |
ariflaim | ariflaim | oriflame |
epika | epica | replica |
grid | grid | creed |
However, if our method is applied to shorter and more frequent queries (as opposed to ”hard” dataset), it tends to suggest frequent words as false-positive corrections (for example, grid is corrected to creed – Assassin’s Creed is popular video game). As can be seen in Table 5, in order to fix that, algorithm parameters need to be tuned more towards precision.
In this paper we introduced the stochastic iterative correction method for spell check corrections. Our experimental evaluation showed that the proposed method improved the performance of popular spelling correction approach – the noisy channel model – in the correction of difficult spelling errors. We showed how to eliminate the local minima issue of simulated annealing and proposed a technique to make our algorithm deterministic.
The experiments conducted on the specialized datasets have shown that our method significantly improves the performance of the correction of hard spelling errors (by 6.6% ) while maintaining good performance on common spelling errors.
In continuation of the work we are considering to expand the method to correct errors in multi-word queries, extend the method to work with discriminative models, and use a query performance prediction method, which tells for a query whether our algorithm needs to be applied.