Automatic Detection of Cognates Using Orthographic Alignment

Alina Maria Ciobanu, Liviu P. Dinu
Faculty of Mathematics and Computer Science, University of Bucharest
Center for Computational Linguistics, University of Bucharest
alina.ciobanu@my.fmi.unibuc.ro,␣ldinu@fmi.unibuc.ro
Abstract

Words undergo various changes when entering new languages. Based on the assumption that these linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method which proved relevant for sequence alignment in computational biology. We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Given a list of known cognates, our approach does not require any other linguistic information. However, it can be customized to integrate historical information regarding language evolution.

1 Introduction

Cognates are words in different languages having the same etymology and a common ancestor. Investigating pairs of cognates is very useful in historical and comparative linguistics, in the study of language relatedness [30], phylogenetic inference [1] and in identifying how and to what extent languages change over time. In other several research areas, such as language acquisition, bilingual word recognition [9], corpus linguistics [32], cross-lingual information retrieval [4] and machine translation [19], the condition of common etymology is usually not essential and cognates are regarded as words with high cross-lingual meaning and orthographic or phonetic similarity.

The wide range of applications in which cognates prove useful attracted more and more attention on methods for detecting such related pairs of words. This task is most challenging for resource-poor languages, for which etymologically related information is not accessible. Therefore, the research [17, 25, 16] focused on automatic identification of cognate pairs, starting from lists of known cognates.

In this paper, we propose a method for automatically determining pairs of cognates across languages. The proposed method requires a list of known cognates and, for languages for which additional linguistic information is available, it can be customized to integrate historical information regarding the evolution of the language. The rest of the paper is organized as follows: in Section 2 we present and analyze alternative methods and related work in this area. In Section 3 we introduce our approach for detection of cognates using orthographic alignment. In Section 4 we describe the experiments we conduct and we report and analyze the results, together with a comparison with previous methods. Finally, in Section 5 we draw the conclusions of our study and describe our plans for extending the method.

2 Related Work

There are three important aspects widely investigated in the task of cognate identification: semantic, phonetic and orthographic similarity. They were employed both individually [32, 17, 6] and combined [21, 34] in order to detect pairs of cognates across languages. For determining semantic similarity, external lexical resources, such as WordNet [10], or large corpora, might be necessary. For measuring phonetic and orthographic proximity of cognate candidates, string similarity metrics can be applied, using the phonetic or orthographic word forms as input. Various measures were investigated and compared [17, 14]; Levenshtein distance [22], XDice [3] and the longest common subsequence ratio [24] are among the most frequently used metrics in this field. Gomes and Lopes (2011) proposed SpSim, a more complex method for computing the similarity of cognate pairs which tolerates learned transitions between words.

Algorithms for string alignment were successfully used for identifying cognates based on both their forms, orthographic and phonetic. Delmestri and Cristianini (2010) used basic sequence alignment algorithms [29, 33, 12] to obtain orthographic alignment scores for cognate candidates. Kondrak (2000) developed the ALINE system, which aligns words’ phonetic transcriptions based on multiple phonetic features and computes similarity scores using dynamic programming. List (2012) proposed a framework for automatic detection of cognate pairs, LexStat, which combines different approaches to sequence comparison and alignment derived from those used in historical linguistics and evolutionary biology.

The changes undergone by words when entering from one language into another and the transformation rules they follow have been successfully employed in various approaches to cognate detection [18, 25, 28]. These orthographic changes have also been used in cognate production, which is closely related to the task of cognate detection, but has not yet been as intensively studied. While the purpose of cognate detection is to determine whether two given words form a cognate pair, the aim of cognate production is, given a word in a source language, to automatically produce its cognate pair in a target language. Beinborn et al. (2013) proposed a method for cognate production relying on statistical character-based machine translation, learning orthographic production patterns, and Mulloni (2007) introduced an algorithm for cognate production based on edit distance alignment and the identification of orthographic cues when words enter a new language.

3 Our Approach

Although there are multiple aspects that are relevant in the study of language relatedness, such as orthographic, phonetic, syntactic and semantic differences, in this paper we focus only on lexical evidence. The orthographic approach relies on the idea that sound changes leave traces in the orthography and alphabetic character correspondences represent, to a fairly large extent, sound correspondences [8].

Words undergo various changes when entering new languages. We assume that rules for adapting foreign words to the orthographic system of the target languages might not have been very well defined in their period of early development, but they may have since become complex and probably language-specific. Detecting pairs of cognates based on etymology is useful and reliable, but, for resource-poor languages, methods which require less linguistic knowledge might be necessary. According to Gusfield (1997), an edit transcript (representing the conversion of one string to another) and an alignment are mathematically equivalent ways of describing relationships between strings. Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms. Our intuition is that inferring language-specific rules for aligning words will lead to better performance in the task of cognate identification.

3.1 Orthographic Alignment

String alignment is closely related to the task of sequence alignment in computational biology. Therefore, to align pairs of words we employ the Needleman-Wunsch global alignment algorithm [29], which is mainly used for aligning sequences of proteins or nucleotides. Global sequence alignment aims at determining the best alignment over the entire length of the input sequences. The algorithm uses dynamic programming and, thus, guarantees to find the optimal alignment. Its main idea is that any partial path of the alignment along the optimal path should be the optimal path leading up to that point. Therefore, the optimal path can be determined by incremental extension of the optimal subpaths [31]. For orthographic alignment, we consider words as input sequences and we use a very simple substitution matrix, which gives equal scores to all substitutions, disregarding diacritics (e.g., we ensure that e and è are matched).

3.2 Feature Extraction

Using aligned pairs of words as input, we extract features around mismatches in the alignments. There are three types of mismatches, corresponding to the following operations: insertion, deletion and substitution. For example, for the Romanian word exhaustiv and its Italian cognate pair esaustivo, the alignment is as follows:

𝚎𝚡𝚑𝚊𝚞𝚜𝚝𝚒𝚟-𝚎𝚜-𝚊𝚞𝚜𝚝𝚒𝚟𝚘

The first mismatch (between x and s) is caused by a substitution, the second mismatch (between h and -) is caused by a deletion from source language to target language, and the third mismatch (between - and o) is caused by an insertion from source language to target language. The features we use are character n-grams around mismatches. We experiment with two types of features:

  • i)

    n-grams around gaps, i.e., we account only for insertions and deletions;

  • ii)

    n-grams around any type of mismatch, i.e., we account for all three types of mismatches.

The second alternative leads to better performance, so we account for all mismatches. As for the length of the grams, we experiment with n{1,2,3}. We achieve slight improvements by combining n-grams, i.e., for a given n, we use all i-grams, where i{1,,n}. In order to provide information regarding the position of the features, we mark the beginning and the end of the word with a $ symbol. Thus, for the above-mentioned pair of cognates, (exhaustiv, esaustivo), we extract the following features when n=2:

𝚡>𝚜𝚎𝚡>𝚎𝚜𝚡𝚑>𝚜-𝚑>-𝚡𝚑>𝚜-𝚑𝚊>-𝚊->𝚘𝚟->𝚟𝚘-$>𝚘$

For identical features we account only once. Therefore, because there is one feature (xh>s-) which occurs twice in our example, we have 8 features for the pair (exhaustiv, esaustivo).

3.3 Learning Algorithms

We use Naive Bayes as a baseline and we experiment with Support Vector Machines (SVMs) to learn orthographic changes and to discriminate between pairs of cognates and non-cognates. We put our system together using the Weka workbench [15], a suite of machine learning algorithms and tools. For SVM, we use the wrapper provided by Weka for LibSVM [5]. We use the radial basis function kernel (RBF), which can handle the case when the relation between class labels and attributes is non-linear, as it maps samples non-linearly into a higher dimensional space. Given two instances xi and xj, where xin, the RBF kernel function for xi and xj is defined as follows:

K(xi,xj)=exp(-γ||xi-xj||2),γ>0,

where γ is a kernel parameter.

We split the data in two subsets, for training and testing, with a 3:1 ratio, and we perform grid search and 3-fold cross validation over the training set in order to optimize hyperparameters c and γ. We search over {1,2,,10} for c and over {10-5,10-4,,10,410}5 for γ. The values which optimize accuracy on the training set are reported, for each pair of languages, in Table 3.

4 Experiments

4.1 Data

We apply our method on an automatically extracted dataset of cognates for four pairs of languages: Romanian-French, Romanian-Italian, Romanian-Spanish and Romanian-Portuguese. In order to build the dataset, we apply the methodology proposed by Ciobanu and Dinu (2014) on the DexOnline11http://dexonline.ro machine-readable dictionary for Romanian. We discard pairs of words for which the forms across languages are identical (i.e., the Romanian word matrice and its Italian cognate pair matrice, having the same form), because these pairs do not provide any orthographic changes to be learned. For each pair of languages we determine a number of non-cognate pairs equal to the number of cognate pairs. Finally, we obtain 445 pairs of cognates for Romanian-French22The number of pairs of cognates is much lower for French than for the other languages because there are numerous Romanian words which have French etymology and, in this paper, we do not consider these words to be cognate candidates., 3,477 for Romanian-Italian, 5,113 for Romanian-Spanish and 7,858 for Romanian-Portuguese. Because we need sets of approximately equal size for comparison across languages, we keep 400 pairs of cognates and 400 pairs of non-cognates for each pair of languages. In Tables 1 and 2 we provide, for each pair of languages, the five most relevant 2-gram orthographic changes, determined using the χ2 distribution implemented in Weka, and the five most frequent 2-gram orthographic changes in the cognate pairs from our dataset33For brevity, we use in the tables the ISO 639-1 codes for language abbreviation. We denote pairs of languages by the target language, given the fact that Romanian is always the source language in our experiments.. None of the top ranked orthographic cues occurs at the beginning of the word, while many of them occur at the end of the word. The most frequent operation in Tables 1 and 2 is substitution.

1st 2nd 3rd 4th 5th
IT iu>io un>on l->le t$>-$ -$>e$
FR un>on ne>n- iu>io ţi>ti e$>-$
ES -$>o$ ţi>ci >ón ie> at>ad
PT ie>ão > ţi>çã i$>-$ ă$>a$
Table 1: The most relevant orthographic cues for each pair of languages determined on the entire datasets using the χ2 attribute evaluation method implemented in Weka.
1st 2nd 3rd 4th 5th
IT -$>e$ -$>o$ ă$>a$ >re ţi>zi
FR e$>-$ un>on ne>n- iu>io ţi>ti
ES -$>o$ e$>-$ ţi>ci ă$>a$ at>ad
PT -$>o$ ă$>a$ e$>-$ -$>r$ -$>a$
Table 2: The most frequent orthographic cues for each pair of languages determined on the cognate lists using the raw frequencies.

4.2 Results Analysis

We propose a method for automatic detection of cognate pairs using orthographic alignment. We experiment with two machine-learning approaches: Naive Bayes and SVM. In Table 3 we report the results of our research. We report the n-gram values for which the best results are obtained and the hyperparameters for SVM, c and γ. The best results are obtained for French and Spanish, while the lowest accuracy is obtained for Portuguese. The SVM produces better results for all languages except Portuguese, where the accuracy is equal. For Portuguese, both Naive Bayes and SVM misclassify more non-cognates as cognates than viceversa. A possible explanation might be the occurrence, in the dataset, of more remotely related words, which are not labeled as cognates. We plan to investigate this assumption and to apply the proposed method on other datasets in our future work.

Naive Bayes SVM
P R A n P R A n c γ
IT 0.72 0.93 79.0 1 0.76 0.92 81.5 1 1 0.10
FR 0.81 0.91 82.0 2 0.84 0.89 87.0 2 10 0.01
ES 0.79 0.92 84.0 1 0.85 0.88 86.5 2 4 0.01
PT 0.67 0.88 73.0 2 0.70 0.78 73.0 2 10 0.01
Table 3: Results for automatic detection of cognates using orthographic alignment. We report the precision (P), recall (R) and accuracy (A) obtained on the test sets and the optimal n-gram values. For SVM we also report the optimal hyperparameters c and γ obtained during cross-validation on the training sets.

4.3 Comparison with Previous Methods

We investigate the performance of the method we propose in comparison to previous approaches for automatic detection of cognate pairs based on orthographic similarity. We employ several orthographic metrics widely used in this research area: the edit distance [22], the longest common subsequence ratio [24] and the XDice metric [3]44We use normalized similarity metrics. For the edit distance, we subtract the normalized value from 1 in order to obtain similarity.. In addition, we use SpSim [11], which outperformed the longest common subsequence ratio and a similarity measure based on the edit distance in previous experiments. To evaluate these metrics on our dataset, we use the same train/test sets as we did in our previous experiments and we follow the strategy described in [17]. First, we compute the pairwise distances between pairs of words for each orthographic metric individually, as a single feature55SpSim cannot be computed directly, as the other metrics, so we introduce an additional step in which we use 1/3 of the training set (only cognates are needed) to learn orthographic changes. In order to maintain a stratified dataset, we discard an equal number of non-cognates in the training set and then we compute the distances for the rest of the training set and for the test set. We use the remaining of the initial training set for the next step of the procedure.. In order to detect the best threshold for discriminating between cognates and non-cognates, we run a decision stump classifier (provided by Weka) on the training set for each pair of languages and for each metric. A decision stump is a decision tree classifier with only one internal node and two leaves corresponding to our two class labels. Using the best threshold value selected for each metric and pair of languages, we further classify the pairs of words in our test sets as cognates or non-cognates. In Table 4 we report the results for each approach. Our method performs better than the orthographic metrics considered as individual features. Out of the four similarity metrics, SpSim obtains, overall, the best performance. These results support the relevance of accounting for orthographic cues in cognate identification.

EDIT LCSR XDICE SPSIM
P R A t P R A t P R A t P R A t
IT 0.67 0.97 75.0 0.43 0.68 0.91 75.0 0.51 0.66 0.98 74.0 0.21 0.66 0.98 74.5 0.44
FR 0.76 0.93 82.0 0.30 0.76 0.90 81.5 0.42 0.77 0.79 78.0 0.26 0.86 0.83 85.0 0.59
ES 0.77 0.91 82.0 0.56 0.72 0.97 80.0 0.47 0.72 0.99 80.5 0.19 0.81 0.90 85.0 0.64
PT 0.62 0.99 69.5 0.34 0.59 0.99 65.5 0.34 0.57 0.99 63.5 0.10 0.62 0.97 69.0 0.39
Table 4: Comparison with previous methods for automatic detection of cognate pairs based on orthography. We report the precision (P), recall (R) and accuracy (A) obtained on the test sets and the optimal threshold t for discriminating between cognates and non-cognates.

5 Conclusions and Future Work

In this paper we proposed a method for automatic detection of cognates based on orthographic alignment. We employed the Needleman-Wunsch algorithm [29] for sequence alignment widely-used in computational biology and we used aligned pairs of words to extract rules for lexical changes occurring when words enter new languages. We applied our method on an automatically extracted dataset of cognates for four pairs of languages.

As future work, we plan to extend our method on a few levels. In this paper we used a very simple substitution matrix for the alignment algorithm, but the method can be adapted to integrate historical information regarding language evolution. The substitution matrix for the alignment algorithm can be customized with language-specific information, in order to reflect the probability of a character to change into another. An important achievement in this direction belongs to Delmestri and Cristianini (2010), who introduced PAM-like matrices, linguistic-inspired substitution matrices which are based on information regarding orthographic changes. We plan to investigate the contribution of using this type of substitution matrices for our method.

We intend to investigate other approaches to string alignment, such as local alignment [33], and other learning algorithms for discriminating between cognates and non-cognates. We plan to extend our analysis with more language-specific features, where linguistic knowledge is available. First, we intend to use the part of speech as an additional feature. We assume that some orthographic changes are dependent on the part of speech of the words. Secondly, we want to investigate whether accounting for the common ancestor language influences the results. We are interested to find out if the orthographic rules depend on the source language, or if they are rather specific to the target language. Finally, we plan to make a performance comparison on cognate pairs versus word-etymon pairs and to investigate false friends [27].

We further intend to adapt our method for cognate detection to a closely related task, namely cognate production, i.e., given an input word w, a related language L and a set of learned rules for orthographic changes, to produce the cognate pair of w in L.

Acknowledgements

We thank the anonymous reviewers for their helpful and constructive comments. The contribution of the authors to this paper is equal. Research supported by CNCS UEFISCDI, project number PN-II-ID-PCE-2011-3-0959.

References

  • [1] Q. D. Atkinson, R. D. Gray, G. K. Nicholls and D. J. Welch(2005) From Words to Dates: Water into Wine, Mathemagic or Phylogenetic Inference?. Transactions of the Philological Society 103, pp. 193–219. Cited by: 1.
  • [2] L. Beinborn, T. Zesch and I. Gurevych(2013) Cognate Production using Character-based Machine Translation. pp. 883–891. Cited by: 2.
  • [3] C. Brew and D. McKelvie(1996) Word-Pair Extraction for Lexicography. pp. 45–55. Cited by: 2, 4.3.
  • [4] C. Buckley, M. Mitra, J. A. Walz and C. Cardie(1997) Using Clustering and SuperConcepts Within SMART: TREC 6. pp. 107–124. Cited by: 1.
  • [5] C. Chang and C. Lin(2011) LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology 2, pp. 27:1–27:27. Note: Software available at \urlhttp://www.csie.ntu.edu.tw/ cjlin/libsvm Cited by: 3.3.
  • [6] K. W. Church(1993) Char align: A program for aligning parallel texts at the character level. pp. 1–8. Cited by: 2.
  • [7] A. M. Ciobanu and L. P. Dinu(2014) Building a Dataset of Multilingual Cognates for the Romanian Lexicon. Cited by: 4.1.
  • [8] A. Delmestri and N. Cristianini(2010) String Similarity Measures and PAM-like Matrices for Cognate Identification. Bucharest Working Papers in Linguistics 12 (2), pp. 71–82. Cited by: 2, 3, 5.
  • [9] T. Dijkstra, F. Grootjen and J. Schepens(2012) Distributions of Cognates in Europe as Based on Levenshtein Distance. Bilingualism: Language and Cognition 15, pp. 157–166. Cited by: 1.
  • [10] C. Fellbaum (Ed.)(1998) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. Cited by: 2.
  • [11] L. Gomes and J. G. P. Lopes(2011) Measuring spelling similarity for cognate identification. See , pp. 624–633. Note: Software available at \urlhttp://research.variancia.com/spsim Cited by: 2, 4.3.
  • [12] O. Gotoh(1982) An improved algorithm for matching biological sequences. Journal of Molecular Biology 162 (3), pp. 705–708. Cited by: 2.
  • [13] D. Gusfield(1997) Algorithms on Strings, Trees and Sequences: computer science and computational biology. Cambridge University Press New York, NY, USA. Cited by: 3.
  • [14] D. Hall and D. Klein(2010) Finding Cognate Groups Using Phylogenies. pp. 1030–1039. Cited by: 2.
  • [15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten(2009) The WEKA data mining software: an update. SIGKDD Explorations 11 (1), pp. 10–18. Note: Software available at \urlhttp://www.cs.waikato.ac.nz/ml/weka Cited by: 3.3.
  • [16] B. Hauer and G. Kondrak(2011) Clustering semantically equivalent words into cognate sets in multilingual lists. pp. 865–873. Cited by: 1.
  • [17] D. Inkpen, O. Frunza and G. Kondrak(2005) Automatic Identification of Cognates and False Friends in French and English. pp. 251–257. Cited by: 1, 2, 4.3.
  • [18] P. Koehn and K. Knight(2000) Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm. pp. 711–715. Cited by: 2.
  • [19] G. Kondrak, D. Marcu and K. Knight(2003) Cognates Can Improve Statistical Translation Models. pp. 46–48. Cited by: 1.
  • [20] G. Kondrak(2000) A New Algorithm for the Alignment of Phonetic Sequences. pp. 288–295. Cited by: 2.
  • [21] G. Kondrak(2004) Combining Evidence in Cognate Identification. pp. 44–59. Cited by: 2.
  • [22] V. I. Levenshtein(1965) Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 10, pp. 707–710. Cited by: 2, 4.3.
  • [23] J. List(2012) LexStat: Automatic Detection of Cognates in Multilingual Wordlists. pp. 117–125. Cited by: 2.
  • [24] D. Melamed(1995) Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons. Cited by: 2, 4.3.
  • [25] A. Mulloni and V. Pekar(2006) Automatic detection of orthographic cues for cognate recognition. pp. 2387–2390. Cited by: 1, 2.
  • [26] A. Mulloni(2007) Automatic Prediction of Cognate Orthography Using Support Vector Machines. pp. 25–30. Cited by: 2.
  • [27] S. Nakov, P. Nakov and E. Paskaleva(2007) Cognate or False Friend? Ask the Web!. pp. 55–62. Cited by: 5.
  • [28] M. Navlea and A. Todirascu(2011) Using Cognates in a French-Romanian Lexical Alignment System: A Comparative Study. pp. 247–253. Cited by: 2.
  • [29] S. B. Needleman and C. D. Wunsch(1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48 (3), pp. 443 – 453. Cited by: 2, 3.1, 5.
  • [30] E. Ng, B. Chin, A. W. Yeo and B. Ranaivo-Malançon(2010) Identification of Closely-Related Indigenous Languages: An Orthographic Approach. Int. J. of Asian Lang. Proc. 20 (2), pp. 43–62. Cited by: 1.
  • [31] G. D. Schuler(2002) Sequence Alignment and Database Searching. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 43. Note: A. D. Baxevanis and B. F. F. Ouellette, John Wiley & Sons, Inc., New York, USA Cited by: 3.1.
  • [32] M. Simard, G. F. Foster and P. Isabelle(1992) Using Cognates to Align Sentences in Bilingual Corpora. Cited by: 1, 2.
  • [33] T. F. Smith and M. S. Waterman(1981) Identification of common molecular subsequences. Journal of Molecular Biology 147 (1), pp. 195–197. Cited by: 2, 5.
  • [34] L. Steiner, P. F. Stadler and M. Cysouw(2011) A pipeline for computational historical linguistics. Language Dynamics and Change 1 (1), pp. 89–127. Cited by: 2.