Words undergo various changes when entering new languages. Based on the assumption that these linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method which proved relevant for sequence alignment in computational biology. We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Given a list of known cognates, our approach does not require any other linguistic information. However, it can be customized to integrate historical information regarding language evolution.
Cognates are words in different languages having the same etymology and a common ancestor. Investigating pairs of cognates is very useful in historical and comparative linguistics, in the study of language relatedness [30], phylogenetic inference [1] and in identifying how and to what extent languages change over time. In other several research areas, such as language acquisition, bilingual word recognition [9], corpus linguistics [32], cross-lingual information retrieval [4] and machine translation [19], the condition of common etymology is usually not essential and cognates are regarded as words with high cross-lingual meaning and orthographic or phonetic similarity.
The wide range of applications in which cognates prove useful attracted more and more attention on methods for detecting such related pairs of words. This task is most challenging for resource-poor languages, for which etymologically related information is not accessible. Therefore, the research [17, 25, 16] focused on automatic identification of cognate pairs, starting from lists of known cognates.
In this paper, we propose a method for automatically determining pairs of cognates across languages. The proposed method requires a list of known cognates and, for languages for which additional linguistic information is available, it can be customized to integrate historical information regarding the evolution of the language. The rest of the paper is organized as follows: in Section 2 we present and analyze alternative methods and related work in this area. In Section 3 we introduce our approach for detection of cognates using orthographic alignment. In Section 4 we describe the experiments we conduct and we report and analyze the results, together with a comparison with previous methods. Finally, in Section 5 we draw the conclusions of our study and describe our plans for extending the method.
There are three important aspects widely investigated in the task of cognate identification: semantic, phonetic and orthographic similarity. They were employed both individually [32, 17, 6] and combined [21, 34] in order to detect pairs of cognates across languages. For determining semantic similarity, external lexical resources, such as WordNet [10], or large corpora, might be necessary. For measuring phonetic and orthographic proximity of cognate candidates, string similarity metrics can be applied, using the phonetic or orthographic word forms as input. Various measures were investigated and compared [17, 14]; Levenshtein distance [22], XDice [3] and the longest common subsequence ratio [24] are among the most frequently used metrics in this field. Gomes and Lopes (2011) proposed SpSim, a more complex method for computing the similarity of cognate pairs which tolerates learned transitions between words.
Algorithms for string alignment were successfully used for identifying cognates based on both their forms, orthographic and phonetic. Delmestri and Cristianini (2010) used basic sequence alignment algorithms [29, 33, 12] to obtain orthographic alignment scores for cognate candidates. Kondrak (2000) developed the ALINE system, which aligns words’ phonetic transcriptions based on multiple phonetic features and computes similarity scores using dynamic programming. List (2012) proposed a framework for automatic detection of cognate pairs, LexStat, which combines different approaches to sequence comparison and alignment derived from those used in historical linguistics and evolutionary biology.
The changes undergone by words when entering from one language into another and the transformation rules they follow have been successfully employed in various approaches to cognate detection [18, 25, 28]. These orthographic changes have also been used in cognate production, which is closely related to the task of cognate detection, but has not yet been as intensively studied. While the purpose of cognate detection is to determine whether two given words form a cognate pair, the aim of cognate production is, given a word in a source language, to automatically produce its cognate pair in a target language. Beinborn et al. (2013) proposed a method for cognate production relying on statistical character-based machine translation, learning orthographic production patterns, and Mulloni (2007) introduced an algorithm for cognate production based on edit distance alignment and the identification of orthographic cues when words enter a new language.
Although there are multiple aspects that are relevant in the study of language relatedness, such as orthographic, phonetic, syntactic and semantic differences, in this paper we focus only on lexical evidence. The orthographic approach relies on the idea that sound changes leave traces in the orthography and alphabetic character correspondences represent, to a fairly large extent, sound correspondences [8].
Words undergo various changes when entering new languages. We assume that rules for adapting foreign words to the orthographic system of the target languages might not have been very well defined in their period of early development, but they may have since become complex and probably language-specific. Detecting pairs of cognates based on etymology is useful and reliable, but, for resource-poor languages, methods which require less linguistic knowledge might be necessary. According to Gusfield (1997), an edit transcript (representing the conversion of one string to another) and an alignment are mathematically equivalent ways of describing relationships between strings. Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms. Our intuition is that inferring language-specific rules for aligning words will lead to better performance in the task of cognate identification.
String alignment is closely related to the task of sequence alignment in computational biology. Therefore, to align pairs of words we employ the Needleman-Wunsch global alignment algorithm [29], which is mainly used for aligning sequences of proteins or nucleotides. Global sequence alignment aims at determining the best alignment over the entire length of the input sequences. The algorithm uses dynamic programming and, thus, guarantees to find the optimal alignment. Its main idea is that any partial path of the alignment along the optimal path should be the optimal path leading up to that point. Therefore, the optimal path can be determined by incremental extension of the optimal subpaths [31]. For orthographic alignment, we consider words as input sequences and we use a very simple substitution matrix, which gives equal scores to all substitutions, disregarding diacritics (e.g., we ensure that e and è are matched).
Using aligned pairs of words as input, we extract features around mismatches in the alignments. There are three types of mismatches, corresponding to the following operations: insertion, deletion and substitution. For example, for the Romanian word exhaustiv and its Italian cognate pair esaustivo, the alignment is as follows:
The first mismatch (between x and s) is caused by a substitution, the second mismatch (between h and -) is caused by a deletion from source language to target language, and the third mismatch (between - and o) is caused by an insertion from source language to target language. The features we use are character -grams around mismatches. We experiment with two types of features:
-grams around gaps, i.e., we account only for insertions and deletions;
-grams around any type of mismatch, i.e., we account for all three types of mismatches.
The second alternative leads to better performance, so we account for all mismatches. As for the length of the grams, we experiment with . We achieve slight improvements by combining -grams, i.e., for a given , we use all -grams, where . In order to provide information regarding the position of the features, we mark the beginning and the end of the word with a $ symbol. Thus, for the above-mentioned pair of cognates, (exhaustiv, esaustivo), we extract the following features when :
For identical features we account only once. Therefore, because there is one feature (xh>s-) which occurs twice in our example, we have 8 features for the pair (exhaustiv, esaustivo).
We use Naive Bayes as a baseline and we experiment with Support Vector Machines (SVMs) to learn orthographic changes and to discriminate between pairs of cognates and non-cognates. We put our system together using the Weka workbench [15], a suite of machine learning algorithms and tools. For SVM, we use the wrapper provided by Weka for LibSVM [5]. We use the radial basis function kernel (RBF), which can handle the case when the relation between class labels and attributes is non-linear, as it maps samples non-linearly into a higher dimensional space. Given two instances and , where , the RBF kernel function for and is defined as follows:
where is a kernel parameter.
We split the data in two subsets, for training and testing, with a 3:1 ratio, and we perform grid search and 3-fold cross validation over the training set in order to optimize hyperparameters c and . We search over for c and over for . The values which optimize accuracy on the training set are reported, for each pair of languages, in Table 3.
We apply our method on an automatically extracted dataset of cognates for four pairs of languages: Romanian-French, Romanian-Italian, Romanian-Spanish and Romanian-Portuguese. In order to build the dataset, we apply the methodology proposed by Ciobanu and Dinu (2014) on the DexOnline11http://dexonline.ro machine-readable dictionary for Romanian. We discard pairs of words for which the forms across languages are identical (i.e., the Romanian word matrice and its Italian cognate pair matrice, having the same form), because these pairs do not provide any orthographic changes to be learned. For each pair of languages we determine a number of non-cognate pairs equal to the number of cognate pairs. Finally, we obtain 445 pairs of cognates for Romanian-French22The number of pairs of cognates is much lower for French than for the other languages because there are numerous Romanian words which have French etymology and, in this paper, we do not consider these words to be cognate candidates., 3,477 for Romanian-Italian, 5,113 for Romanian-Spanish and 7,858 for Romanian-Portuguese. Because we need sets of approximately equal size for comparison across languages, we keep 400 pairs of cognates and 400 pairs of non-cognates for each pair of languages. In Tables 1 and 2 we provide, for each pair of languages, the five most relevant 2-gram orthographic changes, determined using the distribution implemented in Weka, and the five most frequent 2-gram orthographic changes in the cognate pairs from our dataset33For brevity, we use in the tables the ISO 639-1 codes for language abbreviation. We denote pairs of languages by the target language, given the fact that Romanian is always the source language in our experiments.. None of the top ranked orthographic cues occurs at the beginning of the word, while many of them occur at the end of the word. The most frequent operation in Tables 1 and 2 is substitution.
1st | 2nd | 3rd | 4th | 5th | |
---|---|---|---|---|---|
IT | iu> io |
un> on |
l-> le |
t$> -$ |
-$> e$ |
FR | un> on |
ne> n- |
iu> io |
ţi> ti |
e$> -$ |
ES | -$> o$ |
ţi> ci |
–> ón |
ie> ió |
at> ad |
PT | ie> ão |
aţ> aç |
ţi> çã |
i$> -$ |
ă$> a$ |
1st | 2nd | 3rd | 4th | 5th | |
---|---|---|---|---|---|
IT | -$> e$ |
-$> o$ |
ă$> a$ |
–> re |
ţi> zi |
FR | e$> -$ |
un> on |
ne> n- |
iu> io |
ţi> ti |
ES | -$> o$ |
e$> -$ |
ţi> ci |
ă$> a$ |
at> ad |
PT | -$> o$ |
ă$> a$ |
e$> -$ |
-$> r$ |
-$> a$ |
We propose a method for automatic detection of cognate pairs using orthographic alignment. We experiment with two machine-learning approaches: Naive Bayes and SVM. In Table 3 we report the results of our research. We report the -gram values for which the best results are obtained and the hyperparameters for SVM, c and . The best results are obtained for French and Spanish, while the lowest accuracy is obtained for Portuguese. The SVM produces better results for all languages except Portuguese, where the accuracy is equal. For Portuguese, both Naive Bayes and SVM misclassify more non-cognates as cognates than viceversa. A possible explanation might be the occurrence, in the dataset, of more remotely related words, which are not labeled as cognates. We plan to investigate this assumption and to apply the proposed method on other datasets in our future work.
Naive Bayes | SVM | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
P | R | A | P | R | A | c | ||||
IT | 0.72 | 0.93 | 79.0 | 1 | 0.76 | 0.92 | 81.5 | 1 | 1 | 0.10 |
FR | 0.81 | 0.91 | 82.0 | 2 | 0.84 | 0.89 | 87.0 | 2 | 10 | 0.01 |
ES | 0.79 | 0.92 | 84.0 | 1 | 0.85 | 0.88 | 86.5 | 2 | 4 | 0.01 |
PT | 0.67 | 0.88 | 73.0 | 2 | 0.70 | 0.78 | 73.0 | 2 | 10 | 0.01 |
We investigate the performance of the method we propose in comparison to previous approaches for automatic detection of cognate pairs based on orthographic similarity. We employ several orthographic metrics widely used in this research area: the edit distance [22], the longest common subsequence ratio [24] and the XDice metric [3]44We use normalized similarity metrics. For the edit distance, we subtract the normalized value from 1 in order to obtain similarity.. In addition, we use SpSim [11], which outperformed the longest common subsequence ratio and a similarity measure based on the edit distance in previous experiments. To evaluate these metrics on our dataset, we use the same train/test sets as we did in our previous experiments and we follow the strategy described in [17]. First, we compute the pairwise distances between pairs of words for each orthographic metric individually, as a single feature55SpSim cannot be computed directly, as the other metrics, so we introduce an additional step in which we use 1/3 of the training set (only cognates are needed) to learn orthographic changes. In order to maintain a stratified dataset, we discard an equal number of non-cognates in the training set and then we compute the distances for the rest of the training set and for the test set. We use the remaining of the initial training set for the next step of the procedure.. In order to detect the best threshold for discriminating between cognates and non-cognates, we run a decision stump classifier (provided by Weka) on the training set for each pair of languages and for each metric. A decision stump is a decision tree classifier with only one internal node and two leaves corresponding to our two class labels. Using the best threshold value selected for each metric and pair of languages, we further classify the pairs of words in our test sets as cognates or non-cognates. In Table 4 we report the results for each approach. Our method performs better than the orthographic metrics considered as individual features. Out of the four similarity metrics, SpSim obtains, overall, the best performance. These results support the relevance of accounting for orthographic cues in cognate identification.
EDIT | LCSR | XDICE | SPSIM | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | A | t | P | R | A | t | P | R | A | t | P | R | A | t | |
IT | 0.67 | 0.97 | 75.0 | 0.43 | 0.68 | 0.91 | 75.0 | 0.51 | 0.66 | 0.98 | 74.0 | 0.21 | 0.66 | 0.98 | 74.5 | 0.44 |
FR | 0.76 | 0.93 | 82.0 | 0.30 | 0.76 | 0.90 | 81.5 | 0.42 | 0.77 | 0.79 | 78.0 | 0.26 | 0.86 | 0.83 | 85.0 | 0.59 |
ES | 0.77 | 0.91 | 82.0 | 0.56 | 0.72 | 0.97 | 80.0 | 0.47 | 0.72 | 0.99 | 80.5 | 0.19 | 0.81 | 0.90 | 85.0 | 0.64 |
PT | 0.62 | 0.99 | 69.5 | 0.34 | 0.59 | 0.99 | 65.5 | 0.34 | 0.57 | 0.99 | 63.5 | 0.10 | 0.62 | 0.97 | 69.0 | 0.39 |
In this paper we proposed a method for automatic detection of cognates based on orthographic alignment. We employed the Needleman-Wunsch algorithm [29] for sequence alignment widely-used in computational biology and we used aligned pairs of words to extract rules for lexical changes occurring when words enter new languages. We applied our method on an automatically extracted dataset of cognates for four pairs of languages.
As future work, we plan to extend our method on a few levels. In this paper we used a very simple substitution matrix for the alignment algorithm, but the method can be adapted to integrate historical information regarding language evolution. The substitution matrix for the alignment algorithm can be customized with language-specific information, in order to reflect the probability of a character to change into another. An important achievement in this direction belongs to Delmestri and Cristianini (2010), who introduced PAM-like matrices, linguistic-inspired substitution matrices which are based on information regarding orthographic changes. We plan to investigate the contribution of using this type of substitution matrices for our method.
We intend to investigate other approaches to string alignment, such as local alignment [33], and other learning algorithms for discriminating between cognates and non-cognates. We plan to extend our analysis with more language-specific features, where linguistic knowledge is available. First, we intend to use the part of speech as an additional feature. We assume that some orthographic changes are dependent on the part of speech of the words. Secondly, we want to investigate whether accounting for the common ancestor language influences the results. We are interested to find out if the orthographic rules depend on the source language, or if they are rather specific to the target language. Finally, we plan to make a performance comparison on cognate pairs versus word-etymon pairs and to investigate false friends [27].
We further intend to adapt our method for cognate detection to a closely related task, namely cognate production, i.e., given an input word w, a related language L and a set of learned rules for orthographic changes, to produce the cognate pair of w in L.
We thank the anonymous reviewers for their helpful and constructive comments. The contribution of the authors to this paper is equal. Research supported by CNCS UEFISCDI, project number PN-II-ID-PCE-2011-3-0959.