Lexical resource alignment has been an active field of research over the last decade. However, prior methods for aligning lexical resources have been either specific to a particular pair of resources, or heavily dependent on the availability of hand-crafted alignment data for the pair of resources to be aligned. Here we present a unified approach that can be applied to an arbitrary pair of lexical resources, including machine-readable dictionaries with no network structure. Our approach leverages a similarity measure that enables the structural comparison of senses across lexical resources, achieving state-of-the-art performance on the task of aligning WordNet to three different collaborative resources: Wikipedia, Wiktionary and OmegaWiki.
Lexical resources are repositories of machine-readable knowledge that can be used in virtually any Natural Language Processing task. Notable examples are WordNet, Wikipedia and, more recently, collaboratively-curated resources such as OmegaWiki and Wiktionary [13]. On the one hand, these resources are heterogeneous in design, structure and content, but, on the other hand, they often provide complementary knowledge which we would like to see integrated. Given the large scale this intrinsic issue can only be addressed automatically, by means of lexical resource alignment algorithms. Owing to its ability to bring together features like multilinguality and increasing coverage, over the past few years resource alignment has proven beneficial to a wide spectrum of tasks, such as Semantic Parsing [33], Semantic Role Labeling [28], and Word Sense Disambiguation [25].
Nevertheless, when it comes to aligning textual definitions in different resources, the lexical approach [32, 5, 11] falls short because of the potential use of totally different wordings to define the same concept. Deeper approaches leverage semantic similarity to go beyond the surface realization of definitions [26, 20, 27]. While providing good results in general, these approaches fail when the definitions of a given word are not of adequate quality and expressiveness to be distinguishable from one another. When a lexical resource can be viewed as a semantic graph, as with WordNet or Wikipedia, this limit can be overcome by means of alignment algorithms that exploit the network structure to determine the similarity of concept pairs. However, not all lexical resources provide explicit semantic relations between concepts and, hence, machine-readable dictionaries like Wiktionary have first to be transformed into semantic graphs before such graph-based approaches can be applied to them. To do this, recent work has proposed graph construction by monosemous linking, where a concept is linked to all the concepts associated with the monosemous words in its definition [17]. However, this alignment method still involves tuning of parameters which are highly dependent on the characteristics of the generated graphs and, hence, requires hand-crafted sense alignments for the specific pair of resources to be aligned, a task which has to be replicated every time the resources are updated.
In this paper we propose a unified approach to aligning arbitrary pairs of lexical resources which is independent of their specific structure. Thanks to a novel modeling of the sense entries and an effective ontologization algorithm, our approach also fares well when resources lack relational structure or pair-specific training data is absent, meaning that it is applicable to arbitrary pairs without adaptation. We report state-of-the-art performance when aligning WordNet to Wikipedia, OmegaWiki and Wiktionary.
Our approach for aligning lexical resources exploits the graph structure of each resource. Therefore, we assume that a lexical resource can be represented as an undirected graph where is the set of nodes, i.e., the concepts defined in the resource, and is the set of undirected edges, i.e., semantic relations between concepts. Each concept is associated with a set of lexicalizations . For instance, WordNet can be readily represented as an undirected graph whose nodes are synsets and edges are modeled after the relations between synsets defined in WordNet (e.g., hypernymy, meronymy, etc.), and is the mapping between each synset node and the set of synonyms which express the concept. However, other resources such as Wiktionary do not provide semantic relations between concepts and, therefore, have first to be transformed into semantic networks before they can be aligned using our alignment algorithm. We explain in Section 3 how a semi-structured resource which does not exhibit a graph structure can be transformed into a semantic network.
Given a pair of lexical resources and , we align each concept in by mapping it to its corresponding concept(s) in the target lexicon . Algorithm 2 formalizes the alignment process: the algorithm takes as input the semantic graphs and corresponding to the two resources, as explained above, and produces as output an alignment in the form of a set of concept pairs. The algorithm iterates over all concepts and, for each of them, obtains the set of concepts , which can be considered as alignment candidates for (line 2). For a concept , alignment candidates in usually consist of every concept that shares at least one lexicalization with in the same part of speech tag, i.e., [31, 20]. Once the set of target candidates for a source concept is obtained, the alignment task can be cast as that of identifying those concepts in to which should be aligned. To do this, the algorithm calculates the similarity between and each (line 2). If their similarity score exceeds a certain value denoted by (line 2), the two concepts and are aligned and the pair is added to (line 2).
Different resource alignment techniques usually vary in the way they compute the similarity of a pair of concepts across two resources (line 2 in Algorithm 2). In the following, we present our novel approach for measuring the similarity of concept pairs.
[t!] {algorithmic}[1] \REQUIREgraphs , and , the similarity threshold , and the combination parameter \ENSURE, the set of all aligned concept pairs
concept \STATE getCandidates \FORALLconcept \STATE calculateSimilarity \IF \STATE \ENDIF\ENDFOR
Figure 1 illustrates the procedure underlying our cross-resource concept similarity measurement technique. As can be seen, the approach consists of two main components: definitional similarity and structural similarity. Each of these components gets, as its input, a pair of concepts belonging to two different semantic networks and produces a similarity score. These two scores are then combined into an overall score (part (e) of Figure 1) which quantifies the semantic similarity of the two input concepts and .
The definitional similarity component computes the similarity of two concepts in terms of the similarity of their definitions, a method that has also been used in previous work for aligning lexical resources [27, 12]. In spite of its simplicity, the mere calculation of the similarity of concept definitions provides a strong baseline, especially for cases where the definitional texts for a pair of concepts to be aligned are lexically similar, yet distinguishable from the other definitions. However, as mentioned in the introduction, definition similarity-based techniques fail at identifying the correct alignments in cases where different wordings are used or definitions are not of high quality. The structural similarity component, instead, is a novel graph-based similarity measurement technique which calculates the similarity between a pair of concepts across the semantic networks of the two resources by leveraging the semantic structure of those networks. This component goes beyond the surface realization of concepts, thus providing a deeper measure of concept similarity.
The two components share the same backbone (parts (b) and (d) of Figure 1), but differ in some stages (parts (a) and (c) in Figure 1). In the following, we explain all the stages involved in the two components (gray blocks in the figure).
The aim of this stage is to model a given concept or set of concepts through a vectorial semantic representation, which we refer to as the semantic signature of the input. We utilized Personalized PageRank [10, ppr], a random walk graph algorithm, for calculating semantic signatures. The original PageRank (pr) algorithm [3] computes, for a given graph, a single vector wherein each node is associated with a weight denoting its structural importance in that graph. ppr is a variation of pr where the computation is biased towards a set of initial nodes in order to capture the notion of importance with respect to those particular nodes. ppr has been previously used in a wide variety of tasks such as definition similarity-based resource alignment [27], textual semantic similarity [14, 30], Word Sense Disambiguation [1, 6] and semantic text categorization [24]. When applied to a semantic graph by initializing the random walks from a set of concepts (nodes), ppr yields a vector in which each concept is associated with a weight denoting its semantic relevance to the initial concepts.
Formally, we first represent a semantic network consisting of concepts as a row-stochastic transition matrix . The cell in the matrix denotes the probability of moving from a concept to in the graph: 0 if no edge exists from to and otherwise. Then the ppr vector, hence the semantic signature of vector is the unique solution to the linear system: , where is the personalization vector of size in which all the probability mass is put on the concepts for which a semantic signature is to be computed and is the damping factor, which is usually set to 0.85 [3]. We used the ukb11http://ixa2.si.ehu.es/ukb/ off-the-shelf implementation of ppr.
In the definitional similarity component, the two concepts and are first represented by their corresponding definitions and in the respective resources and (Figure 1(a), top). To improve expressiveness, we follow Niemann and Gurevych (2011) and further extend with all the word forms associated with concept and its neighbours, i.e., the union of all lexicalizations for all concepts , where is the set of edges in . In this component the personalization vector is set by uniformly distributing the probability mass over the nodes corresponding to the senses of all the content words in the extended definition of according to the sense inventory of a semantic network . We use the same semantic graph for computing the semantic signatures of both definitions. Any semantic network with a dense relational structure, providing good coverage of the words appearing in the definitions, is a suitable candidate for . For this purpose we used the WordNet [7] graph which was further enriched by connecting each concept to all the concepts appearing in its disambiguated gloss.22http://wordnet.princeton.edu
In the structural similarity component (Figure 1(b), bottom), the semantic signature for each concept is computed by running the ppr algorithm on its corresponding graph , hence a different is built for each of the two concepts.
As mentioned earlier, semantic signatures are vectors with dimension equal to the number of nodes in the semantic graph. Since the structural similarity signatures and are calculated on different graphs and thus have different dimensions, we need to make them comparable by unifying them. We therefore propose an approach (part (c) of Figure 1) that finds a common ground between the two signatures: to this end we consider all the concepts associated with monosemous words in the two signatures as landmarks and restrict the two signatures exclusively to those common concepts. Leveraging monosemous words as bridges between two signatures is a particularly reliable technique as typically a significant portion of all words in a lexicon are monosemous.33For instance, we calculated that more than 80% of the words in WordNet are monosemous, with over 60% of all the synsets containing at least one of them.
Formally, let be an inventory mapping function that maps a term to the set of concepts which are expressed by in graph . Then, given two signatures and , computed on the respective graphs and , we first obtain the set of words that are monosemous according to both semantic networks, i.e., . We then transform each of the two signatures into a new sub-signature whose dimension is : the component of corresponds to the weight in of the only concept of in . As an example, assume we are given two semantic signatures computed for two concepts in WordNet and Wiktionary. Also, consider the noun tradeoff which is monosemous according to both these resources. Then, each of the two unified sub-signatures will contain a component whose weight is determined by the weight of the only concept associated with tradeoff in the corresponding semantic signature. As a result of the unification process, we obtain a pair of equally-sized semantic signatures with comparable components.
Having at hand the semantic signatures for the two input concepts, we proceed to comparing them (part (d) in Figure 1). We leverage a non-parametric measure proposed by Pilehvar et al. (2013) which first transforms each signature into a list of sorted elements and then calculates the similarity on the basis of the average ranking of elements across the two lists:
(1) |
where is the intersection of all concepts with non-zero probability in the two signatures and is the rank of the entry in the sorted list. The denominator is a normalization factor to guarantee a maximum value of one. The method penalizes the differences in the higher rankings more than it does for the lower ones. The measure was shown to outperform the conventional cosine distance when comparing different semantic signatures in multiple textual similarity tasks [30].
Finally (part (e) of Figure 1), we calculate the overall similarity between two concepts as a linear combination of their definitional and structural similarities: In Section 4.2.1, we explain how we set, in our experiments, the values of and the similarity threshold (cf. alignment algorithm in Section 2).
In Section 2, we presented our approach for aligning lexical resources. However, the approach assumes that the input resources can be viewed as semantic networks, which seems to limit its applicability to structured resources only. In order to address this issue and hence generalize our alignment approach to any given lexical resource, we propose a method for transforming a given machine-readable dictionary into a semantic network, a process we refer to as ontologization.
Our ontologization algorithm takes as input a lexicon and outputs a semantic graph where, as already defined in Section 2, is the set of concepts in and is the set of semantic relations between these concepts. Introducing relational links into a lexicon can be achieved in different ways. A first option is to extract binary relations between pairs of words from raw text. Both words in these relations, however, should be disambiguated according to the given lexicon [29], making the task particularly prone to mistakes due to the high number of possible sense pairings.
Here, we take an alternative approach which requires disambiguation on the target side only, hence reducing the size of the search space significantly. We first create the empty undirected graph such that is the set of concepts in and . For each source concept we create a bag of content words which includes all the content words in its definition and, if available, additional related words obtained from lexicon relations (e.g., synonyms in Wiktionary). The problem is then cast as a disambiguation task whose goal is to identify the intended sense of each word according to the sense inventory of : if is monosemous, i.e., , we connect our source concept to the only sense of and set ; else, has multiple senses in . In this latter case, we choose the most appropriate concept by finding the maximal similarity between the definition of and the definitions of each sense of . To do this, we apply our definitional similarity measure introduced in Section 2.1. Having found the intended sense of , we add the edge to . As a result of this procedure, we obtain a semantic graph representation for the lexicon .
As an example, consider the 4 sense of the noun cone in Wiktionary (i.e., cone) which is defined as “The fruit of a conifer”. The definition contains two content words: fruit and conifer. The latter word is monosemous in Wiktionary, hence we directly connect cone to the only sense of conifer. The noun fruit, however, has 5 senses in Wiktionary. We therefore measure the similarity between the definition of cone and all the 5 definitions of fruit and introduce a link from cone to the sense of fruit which yields the maximal similarity value (defined as “(botany) The seed-bearing part of a plant…”).
To enable a comparison with the state of the art, we followed Matuschek and Gurevych (2013) and performed an alignment of WordNet synsets (wn) to three different collaboratively-constructed resources: Wikipedia (wp), Wiktionary (wt), and OmegaWiki (ow). We utilized the DKPro software [35, 9] to access the information in the foregoing three resources. For wp, wt, ow we used the dump versions , , and , respectively.
We followed previous work [25, 17] and evaluated the alignment performance in terms of four measures: precision, recall, F1, and accuracy. Precision is the fraction of correct alignment judgments returned by the system and recall is the fraction of alignment judgments in the gold standard dataset that are correctly returned by the system. F1 is the harmonic mean of precision and recall. We also report results for accuracy which, in addition to true positives, takes into account true negatives, i.e., pairs which are correctly judged as unaligned.
Here, we describe how the four semantic graphs for our four lexical resources (i.e., wn, wp, wt, ow) were constructed. As mentioned in Section 2.1.1, we build the wn graph by including all the synsets and semantic relations defined in WordNet (e.g., hypernymy and meronymy) and further populate the relation set by connecting a synset to all the other synsets that appear in its disambiguated gloss. For wp, we used the graph provided by Matuschek and Gurevych (2013), constructed by directly connecting an article (concept) to all the hyperlinks in its first paragraph, together with the category links. Our wn and wp graphs have 118K and 2.8M nodes, respectively, with the average node degree being roughly 9 in both resources.
The other two resources, i.e., wt and ow, do not provide a reliable network of semantic relations, therefore we used our ontologization approach to construct their corresponding semantic graphs. We report, in the following subsection, the experiments carried out to assess the accuracy of our ontologization method, together with the statistics of the obtained graphs for wt and ow.
For ontologizing wt and ow, the bag of content words is given by the content words in sense definitions and, if available, additional related words obtained from lexicon relations (see Section 3). In wt, both of these are in word surface form and hence had to be disambiguated. For ow, however, the encoded relations, though relatively small in number, are already disambiguated and, therefore, the ontologization was just performed on the definition’s content words.
The resulting graphs for wt and ow contain 430K and 48K nodes, respectively, each providing more than 95% coverage of concepts, with the average node degree being around 10 for both resources. We present in Table 1, for wt and ow, the total number of edges together with their distribution across types (i.e., ambiguous and unambiguous) and sources (i.e., definitions and relations) from which candidate words were obtained.
Source | Type | wt | ow |
Definition | Ambiguous | 76.6% | 50.7% |
Unambiguous | 18.3% | 32.9% | |
Relation | Ambiguous | 2.8% | - |
Unambiguous | 2.3% | 16.4% | |
Total number of edges | 2.1M | 255K | |
The edges obtained from unambiguous entries are essentially sense disambiguated on both sides whereas those obtained from ambiguous terms are a result of our similarity-based disambiguation. Hence, given that a large portion of edges came from ambiguous words (see Table 1), we carried out an experiment to evaluate the accuracy of our disambiguation method. To this end, we took as our benchmark the dataset provided by Meyer and Gurevych (2010) for evaluating relation disambiguation in wt. The dataset contains 394 manually-disambiguated relations. We compared our similarity-based disambiguation approach against the state of the art on this dataset, i.e., the wktwsd system, which is a wt relation disambiguation algorithm based on a series of rules [22].
Table 2 shows the performance of our disambiguation method, together with that of wktwsd, in terms of Precision (P), Recall (R), F1, and accuracy. The “Human” row corresponds to the inter-rater F1 and accuracy scores, i.e., the upperbound performance on this dataset, as calculated by Meyer and Gurevych (2010). As can be seen, our method proves to be very accurate, surpassing the performance of the wktwsd system in terms of precision, F1, and accuracy. This is particularly interesting as the wktwsd system uses a rule-based technique specific to relation disambiguation in wt, whereas our method is resource independent and can be applied to arbitrary words in the definition of any concept. We also note that the graph constructed by Meyer and Gurevych (2010) had an average node degree of around 1.
More recently, Matuschek and Gurevych (2013) leveraged monosemous linking (cf. Section 5) in order to create denser semantic graphs for ow and wt. Our approach, however, thanks to the connections obtained through ambiguous words, can provide graphs with significantly higher coverage. As an example, for wt, Matuschek and Gurevych (2013) generated a graph where around 30% of the nodes were in isolation, whereas this number drops to around 5% in our corresponding graph.
These results show that our ontologization approach can be used to obtain dense semantic graph representations of lexical resources, while at the same time preserving a high level of accuracy. Now that all the four resources are transformed into semantic graphs, we move to our alignment experiments.
Approach | P | R | F1 | A |
---|---|---|---|---|
wktwsd | 0.780 | 0.800 | 0.790 | 0.840 |
Our method | 0.852 | 0.767 | 0.807 | 0.857 |
Human | - | - | 0.890 | 0.910 |
Approach | Training type | wn-wp | wn-wt | wn-ow | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | A | P | R | F1 | A | P | R | F1 | A | ||
sb | Cross-val. | 0.780 | 0.780 | 0.780 | 0.950 | 0.670 | 0.650 | 0.660 | 0.910 | 0.749 | 0.691 | 0.716 | 0.886 |
dwsa | Tuning on subset | 0.750 | 0.670 | 0.710 | 0.930 | 0.680 | 0.270 | 0.390 | 0.890 | 0.651 | 0.372 | 0.473 | 0.830 |
sb+dwsa | Cross-val. + tuning | 0.750 | 0.870 | 0.810 | 0.950 | 0.680 | 0.710 | 0.690 | 0.920 | 0.794 | 0.688 | 0.735 | 0.898 |
SemAlign | Unsupervised | 0.709 | 0.929 | 0.805 | 0.943 | 0.642 | 0.799 | 0.712 | 0.923 | 0.664 | 0.761 | 0.709 | 0.872 |
Tuning on subset | 0.877 | 0.792 | 0.833 | 0.960 | 0.672 | 0.799 | 0.730 | 0.930 | 0.750 | 0.717 | 0.733 | 0.893 | |
Cross-val. | 0.852 | 0.835 | 0.840 | 0.965 | 0.680 | 0.769 | 0.722 | 0.931 | 0.778 | 0.725 | 0.749 | 0.900 | |
Tuning on wn-wp | - | - | - | - | 0.754 | 0.627 | 0.684 | 0.931 | 0.825 | 0.584 | 0.684 | 0.889 | |
Tuning on wn-wt | 0.738 | 0.934 | 0.824 | 0.950 | - | - | - | - | 0.805 | 0.677 | 0.736 | 0.900 | |
Tuning on wn-ow | 0.744 | 0.925 | 0.824 | 0.950 | 0.684 | 0.766 | 0.723 | 0.930 | - | - | - | - |
As our benchmark we tested on the gold standard datasets used in Matuschek and Gurevych (2013) for three alignment tasks: WordNet-Wikipedia (wn-wp), WordNet-Wiktionary (wn-wt), and WordNet-OmegaWiki (wn-ow). However, the dataset for wn-ow was originally built for the German language and, hence, was missing many English ow concepts that could be considered as candidate target alignments. We therefore fixed the dataset for the English language and reproduced the performance of previous work on the new dataset. The three datasets contained 320, 484, and 315 wn concepts that were manually mapped to their corresponding concepts in wp, wt, and ow, respectively.
Recall from Section 2 that our resource alignment technique has two parameters: the similarity threshold and the combination parameter , both defined in [0, 1]. We performed experiments with three different configurations:
Unsupervised, where the two parameters are set to their middle values (i.e., 0.5), hence, no tuning is performed for either of the parameters. In this case, both the definitional and structural similarity scores are treated as equally important and two concepts are aligned if their overall similarity exceeds the middle point of the similarity scale.
Tuning, where we follow Matuschek and Gurevych (2013) and tune the parameters on a subset of the dataset comprising 100 items.
We show in Table 3 the alignment performance of different systems on the task of aligning wn-wp, wn-wt, and wn-ow in terms of Precision (P), Recall (R), F1, and Accuracy. The sb system corresponds to the state-of-the-art definition similarity approaches for wn-wp [27], wn-wt [20], and wn-ow [9]. dwsa stands for Dijkstra-WSA, the state-of-the-art graph-based alignment approach of Matuschek and Gurevych (2013). The authors also provided results for SB+Dijkstra-WSA, a hybrid system where dwsa was tuned for high precision and, in the case when no alignment target could be found, the algorithm fell back on sb judgments. We also show the results for this system as sb+dwsa in the table.
For our approach (SemAlign) we show the results of six different runs each corresponding to a different setting. The first three (middle part of the table) correspond to the results obtained with the three configurations of SemAlign: unsupervised, with tuning on subset, and cross-validation (see Section 4.2.1). In addition to these, we performed experiments where the two parameters of SemAlign were tuned on pair-independent training data, i.e., a training dataset for a pair of resources different from the one being aligned. For this setting, we used the whole dataset of the corresponding resource pair to tune the two parameters of our system. We show the results for this setting in the bottom part of the table (last three lines).
Approach | wn-wp | wn-wt | wn-ow | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F1 | A | P | R | F1 | A | P | R | F1 | A | |
Dijkstra-WSA | 0.750 | 0.670 | 0.710 | 0.930 | 0.680 | 0.270 | 0.390 | 0.890 | 0.651 | 0.372 | 0.473 | 0.830 |
SemAlign | 0.877 | 0.788 | 0.830 | 0.959 | 0.604 | 0.643 | 0.623 | 0.907 | 0.654 | 0.602 | 0.627 | 0.853 |
The main feature worth remarking upon is the consistency in the results across different resource pairs: the unsupervised system gains the best recall among the three configurations (with the improvement over sb+dwsa being always statistically significant44All significance tests are done using z-test at p 0.05.) whereas tuning, both on a subset or through cross-validation, consistently leads to the best performance in terms of F1 and accuracy (with the latter being statistically significant with respect to sb+dwsa on wn-wp and wn-wt).
Moreover, the unsupervised system proves to be very robust inasmuch as it provides competitive results on all the three datasets, while it surpasses the performance of sb+dwsa on wn-wt. This is particularly interesting as the latter system involves tuning of several parameters, whereas SemAlign, in its unsupervised configuration, does not need any training data nor does it involve any tuning. In addition, as can be seen in the table, SemAlign benefits from pair-independent training data in most cases across the three resource pairs with performance surpassing that of sb+dwsa, a system which is dependent on pair-specific training data. The consistency in the performance of SemAlign in its different configurations and across different resource pairs indicates its robustness and shows that our system can be utilized effectively for aligning any pair of lexical resources, irrespective of their structure or availability of training data.
The system performance is generally higher on the alignment task for wp compared to wt and ow. We attribute this difference to the dictionary nature of the latter two, where sense distinctions are more fine-grained, as opposed to the relatively concrete concepts in the wp encyclopedia.
We explained in Section 2.1 that our concept similarity measure consists of two components: the definitional and the structural similarities. Measuring the similarity of two concepts in terms of their definitions has been investigated in previous work [27, 12]. The structural similarity component of our approach, however, is novel, but at the same time one of the very few measures which enables the computation of the similarity of concepts across two resources directly and independently of the similarity of their definitions. A comparable approach is the Dijkstra-WSA proposed by Matuschek and Gurevych (2013) which, as also mentioned earlier in the Introduction, first connects the two resources’ graphs by leveraging monosemous linking and then aligns two concepts across the two graphs on the basis of their shortest distance. To gain more insight into the effectiveness of our structural similarity measure in comparison to the Dijkstra-WSA method, we carried out an experiment where our alignment system used only the structural similarity component, a variant of our system we refer to as SemAlign. Both systems (i.e., SemAlign and Dijkstra-WSA) were tuned on 100-item subsets of the corresponding datasets.
We show in Table 4 the performance of the two systems on our three datasets. As can be seen in the table, SemAlign consistently improves over Dijkstra-WSA according to recall, F1 and accuracy with all the differences in recall and accuracy being statistically significant (p 0.05). The improvement is especially noticeable for pairs involving either wt or ow where, thanks to the relatively denser semantic graphs obtained by means of our ontologization technique, the gap in F1 is about 0.23 (wn-wt) and 0.15 (wn-ow).
In addition, as we mentioned earlier, for wn-wp we used the same graph as that of Dijkstra-WSA, since both wn and wp provide a full-fledged semantic network and thus neither needed to be ontologized. Therefore, the considerable performance improvement over Dijkstra-WSA on this resource pair shows the effectiveness of our novel concept similarity measure independently of the underlying semantic network.
Having lexical resources represented as semantic networks is highly beneficial. A good example is WordNet, which has been exploited as a semantic network in dozens of NLP tasks [7]. A recent prominent case is Wikipedia [18, 13] which, thanks to its inter-article hyperlink structure, provides a rich backbone for structuring additional information [2, 34, 23, 8]. However, there are many large-scale resources, such as Wiktionary for instance, which by their very nature are not in the form of a graph. This is usually the case with machine-readable dictionaries, where structuring the resource involves the arduous task of connecting lexicographic senses by means of semantic relations. Surprisingly, despite their vast potential, little research has been conducted on the automatic ontologization of collaboratively-constructed dictionaries like Wiktionary and OmegaWiki. Meyer and Gurevych (2012a) and Matuschek and Gurevych (2013) provided approaches for building graph representations of Wiktionary and OmegaWiki. The resulting graphs, however, were either sparse or had a considerable portion of the nodes left in isolation. Our approach, in contrast, aims at transforming a lexical resource into a full-fledged semantic network, hence providing a denser graph with most of its nodes connected.
Aligning lexical resources has been a very active field of research in the last decade. One of the main objectives in this area has been to enrich existing ontologies by means of complementary information from other resources. As a matter of fact, most efforts have been concentrated on aligning the de facto community standard sense inventory, i.e. WordNet, to other resources. These include: the Roget’s thesaurus and Longman Dictionary of Contemporary English [15], FrameNet [16], VerbNet [33] or domain-specific terminologies such as the Unified Medical Language System [4]. More recently, the growth of collaboratively-constructed resources has seen the development of alignment approaches with Wikipedia [32, 2, 34, 31, 25], Wiktionary [20] and OmegaWiki [9]. Last year Matuschek and Gurevych (2013) proposed Dijkstra-WSA, a graph-based approach relying on shortest paths between two concepts when the two corresponding resources graphs were combined by leveraging monosemous linking. Their method when backed off with other definition similarity based approaches [27, 20], achieved state-of-the-art results on the mapping of WordNet to different collaboratively-constructed resources. This approach, however, in addition to setting the threshold for the definition similarity component by means of cross validation, also required other parameters to be tuned, such as the allowed path length () and the maximum number of edges in a graph. The optimal value for the parameter varied from one resource pair to another, and even for a specific resource pair it had to be tuned for each configuration. This made the approach dependent on the training data for the specific pair of resources that were to be aligned. Instead of measuring the similarity of two concepts on the basis of their distance in the combined graph, our approach models each concept through a rich vectorial representation we refer to as semantic signature and compares the two concepts in terms of the similarity of their semantic signatures. This rich representation leads to our approach having a good degree of robustness such that it can achieve competitive results even in the absence of training data. This enables our system to be applied effectively for aligning new pairs of resources for which no training data is available, with state-of-the-art performance.
This paper presents a unified approach for aligning lexical resources. Our method leverages a novel similarity measure which enables a direct structural comparison of concepts across different lexical resources. Thanks to an effective ontologization method, our alignment approach can be applied to any pair of lexical resources independently of whether they provide a full-fledged network structure. We demonstrate that our approach achieves state-of-the-art performance on aligning WordNet to three collaboratively-constructed resources with different characteristics, i.e., Wikipedia, Wiktionary, and OmegaWiki. We also show that our approach is robust across its different configurations, even when the training data is absent, enabling it to be used effectively for aligning new pairs of lexical resources for which no resource-specific training data is available. In future work, we plan to extend our concept similarity measure across different natural languages. We release all our data at http://lcl.uniroma1.it/semalign.
The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234.
We would like to thank Michael Matuschek for providing us with Wikipedia graphs and alignment datasets.