The relative frequencies of character bigrams appear to contain much information for predicting the first language (L1) of the writer of a text in another language (L2). Tsur and Rappoport (2007) interpret this fact as evidence that word choice is dictated by the phonology of L1. In order to test their hypothesis, we design an algorithm to identify the most discriminative words and the corresponding character bigrams, and perform two experiments to quantify their impact on the L1 identification task. The results strongly suggest an alternative explanation of the effectiveness of character bigrams in identifying the native language of a writer.
The task of Native Language Identification (NLI) is to determine the first language of the writer of a text in another language. In a ground-breaking paper, Koppel et al. (2005) propose a set of features for this task: function words, character -grams, rare part-of-speech bigrams, and various types of errors. They report 80% accuracy in classifying a set of English texts into five L1 languages using a multi-class linear SVM.
The First Shared Task on Native Language Identification [24] attracted submissions from 29 teams. The accuracy on a set of English texts representing eleven L1 languages ranged from 31% to 83%. Many types of features were employed, including word length, sentence length, paragraph length, document length, sentence complexity, punctuation and capitalization, cognates, dependency parses, topic models, word suffixes, collocations, function word -grams, skip-grams, word networks, Tree Substitution Grammars, string kernels, cohesion, and passive constructions [1, 17, 3, 5, 7, 9, 11, 12, 4, 16, 18, 19, 20, 21, 22, 23, 26]. In particular, word -gram features appear to be particularly effective, as they were used by the most competitive teams, including the one that achieved the highest overall accuracy [13]. Furthermore, the most discriminative word -grams often contained the name of the native language, or countries where it is commonly spoken [8, 19, 21]. We refer to such words as toponymic terms.
There is no doubt that the toponymic terms are useful for increasing the NLI accuracy; however, from the psycho-linguistic perspective, we are more interested in what characteristics of L1 show up in L2 texts. Clearly, L1 affects the L2 writing in general, and the choice of words in particular, but what is the role played by the phonology? Tsur and Rappoport (2007) observe that limiting the set of features to the relative frequency of the 200 most frequent character bigrams yields a respectable 66% accuracy on a 5-language classification task. The authors propose the following hypothesis to explain this finding: “the choice of words [emphasis added] people make when writing in a second language is strongly influenced by the phonology of their native language”. As the orthography of alphabetic languages is at least partially representative of the underlying phonology, character bigrams may capture these phonological preferences.
In this paper, we provide evidence against the above hypothesis. We design an algorithm to identify the most discriminative words and the character bigrams that are indicative of such words, and perform two experiments to quantify their impact on the NLI task. The results of the first experiment demonstrate that the removal of a relatively small set of discriminative words from the training data significantly impairs the accuracy of a bigram-based classifier. The results of the second experiment reveal that the most indicative bigrams are quite similar across different language sets. We conclude that character bigrams are effective in determining L1 of the author because they reflect differences in L2 word usage that are unrelated to the phonology of L1.
Tsur and Rappoport (2007) report that character bigrams are more effective for the NLI task than either unigrams or trigrams. We are interested in identifying the character bigrams that are indicative of the most discriminative words in order to quantify their impact on the bigram-based classifier.
We follow both Koppel et al. (2005) and Tsur and Rappoport (2007) in using a multi-class SVM classifier for the NLI task. The classifier computes a weight for each feature coupled with each L1 language by attempting to maximize the overall accuracy on the training set. For example, if we train the classifier using words as features, with values representing their frequency relative to the length of the document, the features corresponding to the word China might receive the following weights:
Arabic | Chinese | Hindi | Japanese | Telugu |
---|---|---|---|---|
-770 | 1720 | -276 | -254 | -180 |
These weights indicate that the word provides strong positive evidence for Chinese as L1, as opposed to the other four languages.
We propose to quantify the importance of each word by converting its SVM feature weights into a single score using the following formula:
where is the number of languages, and is the feature weight of word in language . The formula assigns higher scores to words with weights of high magnitude, either positive or negative. We use the Euclidean norm rather than the sum of raw weights because we are interested in the discriminative power of the words.
We normalize the word scores by dividing them by the score of the 200th word. Consequently, only the top 200 words have scores greater than or equal to . For our previous example, the 200 word has a word score of 1493, while China has a word score of 1930, which is normalized to . On the other hand, the 1000 word gets a normalized score of 0.43.
[t] {algorithmic}[1] \STATEcreate list of words in training data \STATEtrain SVM using words as features \FORALLwords i \STATE \ENDFOR\STATEsort words by WordScore \STATENormValue = WordScore \STATEcreate list of 200 most frequent bigrams \FORbigrams k = 1 to 200 \STATE \ENDFOR\STATEsort character bigrams by BigramScore
In order to identify the bigrams that are indicative of the most discriminative words, we promote those that appear in the high-scoring words, and downgrade those that appear in the low-scoring words. Some bigrams that appear often in the high-scoring words may be very common. For example, the bigram an occurs in words like Japan, German, and Italian, but also by itself as a determiner, as an adjectival suffix, and as part of the conjunction and. Therefore, we calculate the importance score for each character bigram by multiplying the scores of each word in which the bigram occurs.
Algorithm 2 summarizes our method of identifying the discriminative words and indicative character bigrams. In line 2, we train an SVM on the words encountered in the training data. In lines 3 and 4, we assign the Euclidean norm of the weight vector of each word as its score. Starting in line 7, we determine which character bigrams are representative of high scoring words. In line 10, we calculate the bigram scores.
In this section, we describe two experiments aimed at quantifying the importance of the discriminative words and the indicative character bigrams that are identified by Algorithm 2.
We use two different NLI corpora. We follow the setup of Tsur and Rappoport (2007) by extracting two sets, denoted I1 and I2 (Table 1), from the International Corpus of Learner English (ICLE), Version 2 [10]. Each set consists of 238 documents per language, randomly selected from the ICLE corpus. Each of the documents corresponds to a different author, and contains between 500 and 1000 words. We follow the methodology of the paper in performing 10-fold cross-validation on the sets of languages used by the authors.
For the development of the method described in Section 2, we used a different corpus, namely the TOEFL Non-Native English Corpus [2]. It consists of essays written by native speakers of eleven languages, divided into three English proficiency levels. In order to maintain consistency with the ICLE sets, we extracted three sets of five languages apiece (Table 1), with each set including both related and unrelated languages: European languages that use Latin script (T1), non-European languages that use non-Latin scripts (T2), and a mixture of both types (T3). Each sub-corpus was divided into a training set of 80, and development and test sets of 10 each. The training sets are composed of approximately 700 documents per language, with an average length of 350 words per document. There are over 5000 word types per language, and over 1000 character bigrams in total. The test sets include approximately 90 documents per language. We report results on the test sets, after training on both the training and development sets.
ICLE: | |
---|---|
I1 | Bulgarian Czech French Russian Spanish |
I2 | Czech Dutch Italian Russian Spanish |
TOEFL: | |
T1 | French German Italian Spanish Turkish |
T2 | Arabic Chinese Hindi Japanese Telugu |
T3 | French German Japanese Korean Telugu |
We replicate the experiments of Tsur and Rappoport (2007) by limiting the features to the 200 most frequent character bigrams.11Our development experiments suggest that using the full set of bigrams results in a higher accuracy of a bigram-based classifier. However, we limit the set of features to the 200 most frequent bigrams for the sake of consistency with previous work. The feature values are set to the frequency of the character bigrams normalized by the length of the document. We use these feature vectors as input to the SVM-Multiclass classifier [14]. The results are shown in the Baseline column of Table 2.
Set | Baseline | Random | Discriminative | Random | Indicative |
Words | Words | Bigrams | Bigrams | ||
I1 | 67.5 | 0.2 | 3.6 | 1.0 | 2.2 |
I2 | 66.9 | 2.5 | 5.5 | 0.7 | 2.8 |
T1 | 60.7 | 3.3 | 7.7 | 2.5 | 3.9 |
T2 | 60.6 | 0.5 | 3.8 | 1.1 | 5.9 |
T3 | 62.2 | 0.3 | 0.0 | 0.5 | 4.1 |
The objective of the first experiment is to quantify the influence of the most discriminative words on the accuracy of the bigram-based classifier. Using Algorithm 2, we identify the 100 most discriminative words, and remove them from the training data. The bigram counts are then recalculated, and the new 200 most frequent bigrams are used as features for the character-level SVM. Note that the number of the features in the classifier remains unchanged.
The results are shown in the Discriminative Words column of Table 2. We see a statistically significant drop in the accuracy of the classifier with respect to the baseline in all sets except T3. The words that are identified as the most discriminative include function words, punctuation, very common content words, and the toponymic terms. The 10 highest scoring words from T1 are: indeed, often, statement, : (colon), question, instance, … (ellipsis), opinion, conclude, and however. In addition, France, Turkey, Italian, Germany, and Italy are all found among the top 70 words.
For comparison, we attempt to quantify the effect of removing the same number of randomly-selected words from the training data. Specifically, we discard all tokens that correspond to 100 word types that have the same or slightly higher frequency as the discriminative words. The results are shown in the Random Words column of Table 2. The decrease is much smaller for I1, I2, and T1, while the accuracy actually increases for T2 and T3. This illustrates the impact that the most discriminative words have on the bigram-based classifier beyond simple reduction in the amount of the training data.
Using Algorithm 2, we identify the top 20 character bigrams, and replace them with randomly selected bigrams. The results of this experiment are reported in the Indicative Bigrams column of Table 2. It is to be expected that the replacement of any 20 of the top bigrams with 20 less useful bigrams will result in some drop in accuracy, regardless of which bigrams are chosen for replacement. For comparison, the Random Bigrams column of Table 2 shows the mean accuracy over 100 trials obtained when 20 bigrams randomly selected from the set of 200 bigrams are replaced with random bigrams from outside of the set.
The results indicate that our algorithm indeed identifies 20 bigrams that are on average more important than the other 180 bigrams. What is really striking is that the sets of 20 indicative character bigrams overlap substantially across different sets. Table 3 shows 17 bigrams that are common across the three TOEFL corpora, ordered by their score, together with some of the highly scored words in which they occur. Four of the bigrams consist of punctuation marks and a space.22It appears that only the relatively low frequency of most of the punctuation bigrams prevents them from dominating the sets of the indicative bigrams. When using all bigrams instead of the top 200, the majority of the indicative bigrams contain punctuation. The remaining bigrams indicate function words, toponymic terms like Germany, and frequent content words like take and new.
The situation is similar in the ICLE sets, where likewise 17 out of 20 bigrams are common. The inter-fold overlap is even greater, with 19 out of 20 bigrams appearing in each of the 10 folds. In particular, the bigrams fr and bu can be traced to both the function words from and but, and the presence of French and Bulgarian in I1. However, the fact that the two bigrams are also on the list for the I2 set, which does not include these languages, suggests that their importance is mostly due to the function words.
Bigram | Words |
---|---|
_, | |
,_ | |
_. | |
._ | |
u_ | you Telugu |
f_ | of |
ny | any many Germany |
yo | you your |
w_ | now how |
i_ | I |
_y | you your |
ew | new knew |
kn | know knew |
ey | they Turkey |
wh | what why where etc. |
of | of |
ak | make take |
In the first experiment, we showed that the removal of the 100 most discriminative words from the training data results in a significant drop in the accuracy of the classifier that is based exclusively on character bigrams. If the hypothesis of Tsur and Rappoport (2007) was true, this should not be the case, as the phonology of L1 would influence the choice of words across the lexicon.
In the second experiment, we found that the majority of the most indicative character bigrams are shared among different language sets. The bigrams appear to reflect primarily high-frequency function words. If the hypothesis was true, this should not be the case, as the diverse L1 phonologies would induce different sets of bigrams. In fact, the highest scoring bigrams reflect punctuation patterns, which have little to do with word choice.
We have provided experimental evidence against the hypothesis that the phonology of L1 strongly affects the choice of words in L2. We showed that a small set of high-frequency function words have disproportionate influence on the accuracy of a bigram-based NLI classifier, and that the majority of the indicative bigrams appear to be independent of L1. This suggests an alternative explanation of the effectiveness of a bigram-based classifier in identifying the native language of a writer — that the character bigrams simply mirror differences in the word usage rather than the phonology of L1.
Our explanation concurs with the findings of Daland (2013) that unigram frequency differences in certain types of phonological segments between child-directed and adult-directed speech are due to a small number of word types, such as you, what, and want, rather than to any general phonological preferences. He argues that the relative frequency of sounds in speech is driven by the relative frequency of words. In a similar vein, Koppel et al. (2005) see the usefulness of character -grams as “simply an artifact of variable usage of particular words, which in turn might be the result of different thematic preferences,” or as a reflection of the L1 orthography.
We conclude by noting that our experimental results do not imply that the phonology of L1 has absolutely no influence on L2 writing. Rather, they show that the evidence from the Native Language Identification task has so far been inconclusive in this regard.
We thank the participants and the organizers of the shared task on NLI at the BEA8 workshop for sharing their reflections on the task. We also thank an anonymous reviewer for pointing out the study of Daland (2013).
This research was supported by the Natural Sciences and Engineering Research Council of Canada and the Alberta Innovates Technology Futures.