We introduce a novel approach for building language models based on a systematic, recursive exploration of skip -gram models which are interpolated using modified Kneser-Ney smoothing. Our approach generalizes language models as it contains the classical interpolation with lower order models as a special case. In this paper we motivate, formalize and present our approach. In an extensive empirical experiment over English text corpora we demonstrate that our generalized language models lead to a substantial reduction of perplexity between and in comparison to traditional language models using modified Kneser-Ney smoothing. Furthermore, we investigate the behaviour over three other languages and a domain specific corpus where we observed consistent improvements. Finally, we also show that the strength of our approach lies in its ability to cope in particular with sparse training data. Using a very small training data set of only KB text we yield improvements of even reduction of perplexity.
Language Models are a probabilistic approach for predicting the occurrence of a sequence of words. They are used in many applications, e.g. word prediction [], speech recognition [], machine translation [], or spelling correction []. The task language models attempt to solve is the estimation of a probability of a given sequence of words . The probability of this sequence can be broken down into a product of conditional probabilities:
(1) |
Because of combinatorial explosion and data sparsity, it is very difficult to reliably estimate the probabilities that are conditioned on a longer subsequence. Therefore, by making a Markov assumption the true probability of a word sequence is only approximated by restricting conditional probabilities to depend only on a local context of preceding words rather than the full sequence . The challenge in the construction of language models is to provide reliable estimators for the conditional probabilities. While the estimators can be learnt—using, e.g., a maximum likelihood estimator over -grams obtained from training data—the obtained values are not very reliable for events which may have been observed only a few times or not at all in the training data.
Smoothing is a standard technique to overcome this data sparsity problem. Various smoothing approaches have been developed and applied in the context of language models. Chen and Goodman [] introduced modified Kneser-Ney Smoothing, which up to now has been considered the state-of-the-art method for language modelling over the last 15 years. Modified Kneser-Ney Smoothing is an interpolating method which combines the estimated conditional probabilities recursively with lower order models involving a shorter local context and their estimate for . The motivation for using lower order models is that shorter contexts may be observed more often and, thus, suffer less from data sparsity. However, a single rare word towards the end of the local context will always cause the context to be observed rarely in the training data and hence will lead to an unreliable estimation.
Because of Zipfian word distributions, most words occur very rarely and hence their true probability of occurrence may be estimated only very poorly. One word that appears at the end of a local context and for which only a poor approximation exists may adversely affect the conditional probabilities in language models of all lengths — leading to severe errors even for smoothed language models. Thus, the idea motivating our approach is to involve several lower order models which systematically leave out one position in the context (one may think of replacing the affected word in the context with a wildcard) instead of shortening the sequence only by one word at the beginning.
This concept of introducing gaps in -grams is referred to as skip -grams []. Among other techniques, skip -grams have also been considered as an approach to overcome problems of data sparsity []. However, to best of our knowledge, language models making use of skip -grams models have never been investigated to their full extent and over different levels of lower order models. Our approach differs as we consider all possible combinations of gaps in a local context and interpolate the higher order model with all possible lower order models derived from adding gaps in all different ways.
In this paper we make the following contributions:
We provide a framework for using modified Kneser-Ney smoothing in combination with a systematic exploration of lower order models based on skip -grams.
We show how our novel approach can indeed easily be interpreted as a generalized version of the current state-of-the-art language models.
We present a large scale empirical analysis of our generalized language models on eight data sets spanning four different languages, namely, a wikipedia-based text corpus and the JRC-Acquis corpus of legislative texts.
We empirically observe that introducing skip -gram models may reduce perplexity by compared to the current state-of-the-art using modified Kneser-Ney models on large data sets. Using small training data sets we observe even higher reductions of perplexity of up to .
The rest of the paper is organized as follows. We start with reviewing related work in Section 2. We will then introduce our generalized language models in Section 3. After explaining the evaluation methodology and introducing the data sets in Section 4 we will present the results of our evaluation in Section 5. In Section 6 we discuss why a generalized language model performs better than a standard language model. Finally, in Section 7 we summarize our findings and conclude with an overview of further interesting research challenges in the field of generalized language models.
Work related to our generalized language model approach can be divided in two categories: various smoothing techniques for language models and approaches making use of skip -grams.
Smoothing techniques for language models have a long history. Their aim is to overcome data sparsity and provide more reliable estimators—in particular for rare events. The Good Turing estimator [], deleted interpolation [], Katz backoff [] and Kneser-Ney smoothing [] are just some of the approaches to be mentioned. Common strategies of these approaches are to either backoff to lower order models when a higher order model lacks sufficient training data for good estimation, to interpolate between higher and lower order models or to interpolate with a prior distribution. Furthermore, the estimation of the amount of unseen events from rare events aims to find the right weights for interpolation as well as for discounting probability mass from unreliable estimators and to retain it for unseen events.
The state of the art is a modified version of Kneser-Ney smoothing introduced in []. The modified version implements a recursive interpolation with lower order models, making use of different discount values for more or less frequently observed events. This variation has been compared to other smoothing techniques on various corpora and has shown to outperform competing approaches. We will review modified Kneser-Ney smoothing in Section 2.1 in more detail as we reuse some ideas to define our generalized language model.
Smoothing techniques which do not rely on using lower order models involve clustering [], i.e. grouping together similar words to form classes of words, as well as skip -grams []. Yet other approaches make use of permutations of the word order in -grams [].
Skip -grams are typically used to incorporate long distance relations between words. Introducing the possibility of gaps between the words in an -gram allows for capturing word relations beyond the level of consecutive words without an exponential increase in the parameter space. However, with their restriction on a subsequence of words, skip -grams are also used as a technique to overcome data sparsity []. In related work different terminology and different definitions have been used to describe skip -grams. Variations modify the number of words which can be skipped between elements in an -gram as well as the manner in which the skipped words are determined (e.g. fixed patterns [] or functional words []).
The impact of various extensions and smoothing techniques for language models is investigated in []. In particular, the authors compared Kneser-Ney smoothing, Katz backoff smoothing, caching, clustering, inclusion of higher order -grams, sentence mixture and skip -grams. They also evaluated combinations of techniques, for instance, using skip -gram models in combination with Kneser-Ney smoothing. The experiments in this case followed two paths: (1) interpolating a -gram model with lower order distribution introducing a single gap and (2) interpolating higher order models with skip -grams which retained only combinations of two words. Goodman reported on small data sets and in the best case a moderate improvement of cross entropy in the range of to .
In [], the authors investigated the increase of observed word combinations when including skips in -grams. The conclusion was that using skip -grams is often more effective for increasing the number of observations than increasing the corpus size. This observation aligns well with our experiments.
We briefly recall modified Kneser-Ney Smoothing as presented in []. Modified Kneser-Ney implements smoothing by interpolating between higher and lower order -gram language models. The highest order distribution is interpolated with lower order distribution as follows:
(2) |
where provides the frequency count that sequence occurs in training data, is a discount value (which depends on the frequency of the sequence) and depends on and is the interpolation factor to mix in the lower order distribution11The factors and are quite technical and lengthy. As they do not play a significant role for understanding our novel approach we refer to Appendix A for details.. Essentially, interpolation with a lower order model corresponds to leaving out the first word in the considered sequence. The lower order models are computed differently using the notion of continuation counts rather than absolute counts:
(3) |
where the continuation counts are defined as , i.e. the number of different words which precede the sequence . The term is again an interpolation factor which depends on the discounted probability mass in the first term of the formula.
We express skip -grams using an operator notation. The operator applied to an -gram removes the word at the -th position. For instance: , where is used as wildcard placeholder to indicate a removed word. The wildcard operator allows for larger number of matches. For instance, when and then since at least the two sequences and match the sequence . In order to align with standard language models the skip operator applied to the first word of a sequence will remove the word instead of introducing a wildcard. In particular the equation holds where the right hand side is the subsequence of omitting the first word. We can thus formulate the interpolation step of modified Kneser-Ney smoothing using our notation as .
Thus, our skip -grams correspond to -grams of which we only use words, after having applied the skip operators
Interpolation with lower order models is motivated by the problem of data sparsity in higher order models. However, lower order models omit only the first word in the local context, which might not necessarily be the cause for the overall -gram to be rare. This is the motivation for our generalized language models to not only interpolate with one lower order model, where the first word in a sequence is omitted, but also with all other skip -gram models, where one word is left out. Combining this idea with modified Kneser-Ney smoothing leads to a formula similar to (2.1).
(4) |
The difference between formula (2.1) and formula (3.2) is the way in which lower order models are interpolated.
Note, the sum over all possible positions in the context for which we can skip a word and the according lower order models . We give all lower order models the same weight .
The same principle is recursively applied in the lower order models in which some words of the full -gram are already skipped. As in modified Kneser-Ney smoothing we use continuation counts for the lower order models, incorporating the skip operator also for these counts. Incorporating this directly into modified Kneser-Ney smoothing leads in the second highest model to:
(5) | ||||
Given that we skip words at different positions, we have to extend the notion of the count function and the continuation counts. The count function applied to a skip -gram is given by , i.e. we aggregate the count information over all words which fill the gap in the -gram. Regarding the continuation counts we define:
(6) | |||||
(7) |
As lowest order model we use—just as done for traditional modified Kneser-Ney []—a unigram model interpolated with a uniform distribution for unseen words.
The overall process is depicted in Figure 1, illustrating how the higher level models are recursively smoothed with several lower order ones.
To evaluate the quality of our generalized language models we empirically compare their ability to explain sequences of words. To this end we use text corpora, split them into test and training data, build language models as well as generalized language models over the training data and apply them on the test data. We employ established metrics, such as cross entropy and perplexity. In the following we explain the details of our experimental setup.
For evaluation purposes we employed eight different data sets. The data sets cover different domains and languages. As languages we considered English (en), German (de), French (fr), and Italian (it). As general domain data set we used the full collection of articles from Wikipedia (wiki) in the corresponding languages. The download dates of the dumps are displayed in Table 1.
de | en | fr | it |
Nov 22nd | Nov 04th | Nov 20th | Nov 25th |
Special purpose domain data are provided by the multi-lingual JRC-Acquis corpus of legislative texts (JRC) []. Table 2 gives an overview of the data sets and provides some simple statistics of the covered languages and the size of the collections.
Statistics | ||
\arraybackslashCorpus | total words | unique words |
\arraybackslash | in Mio. | in Mio. |
\arraybackslashwiki-de | 579 | 9.82 |
\arraybackslashJRC-de | 30.9 | 0.66 |
\arraybackslashwiki-en | 1689 | 11.7 |
\arraybackslashJRC-en | 39.2 | 0.46 |
\arraybackslashwiki-fr | 339 | 4.06 |
\arraybackslashJRC-fr | 35.8 | 0.46 |
\arraybackslashwiki-it | 193 | 3.09 |
\arraybackslashJRC-it | 34.4 | 0.47 |
The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences. We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks. Eventually, the word token sequences were split into word sequences of length which provided the basis for the training and test sets for all algorithms. Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms. Also, we did our evaluation using case sensitive training and test data. Additionally, we kept all tokens for named entities such as names of persons or places.
All data sets have been randomly split into a training and a test set on a sentence level. The training sets consist of 80% of the sentences, which have been used to derive -grams, skip -grams and corresponding continuation counts for values of between 1 and 5. Note that we have trained a prediction model for each data set individually. From the remaining 20% of the sequences we have randomly sampled a separate set of sequences of words each. These test sequences have also been shortened to sequences of length , and and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.
We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison. To ensure rigour and openness of research the data set for training as well as the test sequences and the entire source code is open source. 22http://west.uni-koblenz.de/Research 33https://github.com/renepickhardt/generalized-language-modeling-toolkit 44http://glm.rene-pickhardt.de We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit 55http://www.phontron.com/kylm/. Since we got the same results for small and small data sets we believe that our implementation is correct.
In a second experiment we have investigated the impact of the size of the training data set. The wikipedia corpus consists of bn. words. Thus, the split for training consists of bn. words. We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude. So we created / and / split, and so on. We have stopped at the split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split. Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data. Again we have evaluated these language models on the same random sample of sequences as mentioned above.
As evaluation metric we use perplexity: a standard measure in the field of language models []. First we calculate the cross entropy of a trained language model given a test set using
(8) |
Where will be replaced by the probability estimates provided by our generalized language models and the estimates of a language model using modified Kneser-Ney smoothing. , instead, is a maximum likelihood estimator of the test sequence to occur in the test corpus. Finally, is the set of test sequences. The perplexity is defined as:
(9) |
Lower perplexity values indicate better results.
As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN). These models have been trained for model lengths to . For unigram and bigram models MKN and GLM are identical.
The perplexity values for all data sets and various model orders can be seen in Table 3. In this table we also present the relative reduction of perplexity in comparison to the baseline.
model length | |||
---|---|---|---|
\arraybackslashExperiments | |||
\arraybackslashwiki-de MKN | 1074.1 | 778.5 | 597.1 |
\arraybackslashwiki-de GLM | 1031.1 | 709.4 | 521.5 |
\arraybackslashrel. change | 4.0% | 8.9% | 12.7% |
\arraybackslashJRC-de MKN | 235.4 | 138.4 | 94.7 |
\arraybackslashJRC-de GLM | 229.4 | 131.8 | 86.0 |
\arraybackslashrel. change | 2.5% | 4.8% | 9.2% |
\arraybackslashwiki-en MKN | 586.9 | 404 | 307.3 |
\arraybackslashwiki-en GLM | 571.6 | 378.1 | 275 |
\arraybackslashrel. change | 2.6% | 6.1% | 10.5% |
\arraybackslashJRC-en MKN | 147.2 | 82.9 | 54.6 |
\arraybackslashJRC-en GLM | 145.3 | 80.6 | 52.5 |
\arraybackslashrel. change | 1.3% | 2.8% | 3.9% |
\arraybackslashwiki-fr MKN | 538.6 | 385.9 | 298.9 |
\arraybackslashwiki-fr GLM | 526.7 | 363.8 | 272.9 |
\arraybackslashrel. change | 2.2% | 5.7% | 8.7% |
\arraybackslashJRC-fr MKN | 155.2 | 92.5 | 63.9 |
\arraybackslashJRC-fr GLM | 153.5 | 90.1 | 61.7 |
\arraybackslashrel. change | 1.1% | 2.5% | 3.5% |
\arraybackslashwiki-it MKN | 738.4 | 532.9 | 416.7 |
\arraybackslashwiki-it GLM | 718.2 | 500.7 | 382.2 |
\arraybackslashrel. change | 2.7% | 6.0% | 8.3% |
\arraybackslashJRC-it MKN | 177.5 | 104.4 | 71.8 |
\arraybackslashJRC-it GLM | 175.1 | 101.8 | 69.6 |
\arraybackslashrel. change | 1.3% | 2.6% | 3.1% |
As we can see, the GLM clearly outperforms the baseline for all model lengths and data sets. In general we see a larger improvement in performance for models of higher orders (). The gain for 3-gram models, instead, is negligible. For German texts the increase in performance is the highest () for a model of order . We also note that GLMs seem to work better on broad domain text rather than special purpose text as the reduction on the wiki corpora is constantly higher than the reduction of perplexity on the JRC corpora.
We made consistent observations in our second experiment where we iteratively shrank the size of the training data set. We calculated the relative reduction in perplexity from MKN to GLM for various model lengths and the different sizes of the training data. The results for the English Wikipedia data set are illustrated in Figure 2.
We see that the GLM performs particularly well on small training data. As the size of the training data set becomes smaller (even smaller than the evaluation data), the GLM achieves a reduction of perplexity of up to compared to language models with modified Kneser-Ney smoothing on the same data set. The absolute perplexity values for this experiment are presented in Table 4.
model length | |||
\arraybackslashExperiments | |||
\arraybackslash MKN | 586.9 | 404 | 307.3 |
\arraybackslash GLM | 571.6 | 378.1 | 275 |
\arraybackslashrel. change | 2.6% | 6.5% | 10.5% |
\arraybackslash MKN | 712.6 | 539.8 | 436.5 |
\arraybackslash GLM | 683.7 | 492.8 | 382.5 |
\arraybackslashrel. change | 4.1% | 8.7% | 12.4% |
\arraybackslash MKN | 894.0 | 730.0 | 614.1 |
\arraybackslash GLM | 838.7 | 650.1 | 528.7 |
\arraybackslashrel. change | 6.2% | 10.9% | 13.9% |
\arraybackslash MKN | 1099.5 | 963.8 | 845.2 |
\arraybackslash GLM | 996.6 | 820.7 | 693.4 |
\arraybackslashrel. change | 9.4% | 14.9% | 18.0% |
\arraybackslash MKN | 1212.1 | 1120.5 | 1009.6 |
\arraybackslash GLM | 1025.6 | 875.5 | 750.3 |
\arraybackslashrel. change | 15.4% | 21.9% | 25.7% |
Our theory as well as the results so far suggest that the GLM performs particularly well on sparse training data. This conjecture has been investigated in a last experiment. For each model length we have split the test data of the largest English Wikipedia corpus into two disjoint evaluation data sets. The data set unseen consists of all test sequences which have never been observed in the training data. The set observed consists only of test sequences which have been observed at least once in the training data. Again we have calculated the perplexity of each set. For reference, also the values of the complete test data set are shown in Table 5.
model length | |||
\arraybackslashExperiments | |||
\arraybackslashMKNcomplete | 586.9 | 404 | 307.3 |
\arraybackslashGLMcomplete | 571.6 | 378.1 | 275 |
\arraybackslashrel. change | 2.6% | 6.5% | 10.5% |
\arraybackslashMKNunseen | 14696.8 | 2199.8 | 846.1 |
\arraybackslashGLMunseen | 13058.7 | 1902.4 | 714.4 |
\arraybackslashrel. change | 11.2% | 13.5% | 15.6% |
\arraybackslashMKNobserved | 220.2 | 88.0 | 43.4 |
\arraybackslashGLMobserved | 220.6 | 88.3 | 43.5 |
\arraybackslashrel. change |
As expected we see the overall perplexity values rise for the unseen test case and decline for the observed test case. More interestingly we see that the relative reduction of perplexity of the GLM over MKN increases from to on the unseen test case. This indicates that the superior performance of the GLM on small training corpora and for higher order models indeed comes from its good performance properties with regard to sparse training data. It also confirms that our motivation to produce lower order -grams by omitting not only the first word of the local context but systematically all words has been fruitful. However, we also see that for the observed sequences the GLM performs slightly worse than MKN. For the observed cases we find the relative change to be negligible.
In our experiments we have observed an improvement of our generalized language models over classical language models using Kneser-Ney smoothing. The improvements have been observed for different languages, different domains as well as different sizes of the training data. In the experiments we have also seen that the GLM performs well in particular for small training data sets and sparse data, encouraging our initial motivation. This feature of the GLM is of particular value, as data sparsity becomes a more and more immanent problem for higher values of . This known fact is underlined also by the statistics shown in Table 6. The fraction of total -grams which appear only once in our Wikipedia corpus increases for higher values of . However, for the same value of the skip -grams are less rare. Our generalized language models leverage this additional information to obtain more reliable estimates for the probability of word sequences.
\arraybackslash | total | unique |
---|---|---|
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash | ||
\arraybackslash |
Beyond the general improvements there is an additional path for benefitting from generalized language models. As it is possible to better leverage the information in smaller and sparse data sets, we can build smaller models of competitive performance. For instance, when looking at Table 4 we observe the -gram MKN approach on the full training data set to achieve a perplexity of . This model has been trained on GB of text and the resulting model has a size of GB and Mio. entries for the count and continuation count values. Looking for a GLM with comparable but better performance we see that the -gram model trained on of the training data has a perplexity of . This GLM model has a size of GB and contains only Mio. entries. So, using a far smaller set of training data we can build a smaller model which still demonstrates a competitive performance.
We have introduced a novel generalized language model as the systematic combination of skip -grams and modified Kneser-Ney smoothing. The main strength of our approach is the combination of a simple and elegant idea with an an empirically convincing result. Mathematically one can see that the GLM includes the standard language model with modified Kneser-Ney smoothing as a sub model and is consequently a real generalization.
In an empirical evaluation, we have demonstrated that for higher orders the GLM outperforms MKN for all test cases. The relative improvement in perplexity is up to for large data sets. GLMs also performs particularly well on small and sparse sets of training data. On a very small training data set we observed a reduction of perplexity by . Our experiments underline that the generalized language models overcome in particular the weaknesses of modified Kneser-Ney smoothing on sparse training data.
A desirable extension of our current definition of GLMs will be the combination of different lower lower order models in our generalized language model using different weights for each model. Such weights can be used to model the statistical reliability of the different lower order models. The value of the weights would have to be chosen according to the probability or counts of the respective skip -grams.
Another important step that has not been considered yet is compressing and indexing of generalized language models to improve the performance of the computation and be able to store them in main memory. Regarding the scalability of the approach to very large data sets we intend to apply the Map Reduce techniques from [] to our generalized language models in order to have a more scalable calculation.
This will open the path also to another interesting experiment. Goodman [] observed that increasing the length of -grams in combination with modified Kneser-Ney smoothing did not lead to improvements for values of beyond 7. We believe that our generalized language models could still benefit from such an increase. They suffer less from the sparsity of long -grams and can overcome this sparsity when interpolating with the lower order skip -grams while benefiting from the larger context.
Finally, it would be interesting to see how applications of language models—like next word prediction, machine translation, speech recognition, text classification, spelling correction, e.g.—benefit from the better performance of generalized language models.
We would like to thank Heinrich Hartmann for a fruitful discussion regarding notation of the skip operator for -grams. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013), REVEAL (Grant agree number 610928).
The discount value used in formula (2.1) is defined as []:
(10) |
The discounting values , , and are defined as []
(11a) | |||
(11b) | |||
(11c) |
with and is the total number of -grams which appear exactly times in the training data. The weight is defined as:
(12) | ||||
And the weight is defined as:
(13) | ||||
where , , and are analogously defined to .