A Robust Approach to Aligning Heterogeneous Lexical Resources

Mohammad Taher Pilehvar Roberto Navigli
Department of Computer Science
Sapienza University of Rome
{pilehvar,navigli}@di.uniroma1.it

Abstract

Lexical resource alignment has been an active field of research over the last decade. However, prior methods for aligning lexical resources have been either specific to a particular pair of resources, or heavily dependent on the availability of hand-crafted alignment data for the pair of resources to be aligned. Here we present a unified approach that can be applied to an arbitrary pair of lexical resources, including machine-readable dictionaries with no network structure. Our approach leverages a similarity measure that enables the structural comparison of senses across lexical resources, achieving state-of-the-art performance on the task of aligning WordNet to three different collaborative resources: Wikipedia, Wiktionary and OmegaWiki.

1 Introduction

Lexical resources are repositories of machine-readable knowledge that can be used in virtually any Natural Language Processing task. Notable examples are WordNet, Wikipedia and, more recently, collaboratively-curated resources such as OmegaWiki and Wiktionary [13]. On the one hand, these resources are heterogeneous in design, structure and content, but, on the other hand, they often provide complementary knowledge which we would like to see integrated. Given the large scale this intrinsic issue can only be addressed automatically, by means of lexical resource alignment algorithms. Owing to its ability to bring together features like multilinguality and increasing coverage, over the past few years resource alignment has proven beneficial to a wide spectrum of tasks, such as Semantic Parsing [33], Semantic Role Labeling [28], and Word Sense Disambiguation [25].

Nevertheless, when it comes to aligning textual definitions in different resources, the lexical approach [32, 5, 11] falls short because of the potential use of totally different wordings to define the same concept. Deeper approaches leverage semantic similarity to go beyond the surface realization of definitions [26, 20, 27]. While providing good results in general, these approaches fail when the definitions of a given word are not of adequate quality and expressiveness to be distinguishable from one another. When a lexical resource can be viewed as a semantic graph, as with WordNet or Wikipedia, this limit can be overcome by means of alignment algorithms that exploit the network structure to determine the similarity of concept pairs. However, not all lexical resources provide explicit semantic relations between concepts and, hence, machine-readable dictionaries like Wiktionary have first to be transformed into semantic graphs before such graph-based approaches can be applied to them. To do this, recent work has proposed graph construction by monosemous linking, where a concept is linked to all the concepts associated with the monosemous words in its definition [17]. However, this alignment method still involves tuning of parameters which are highly dependent on the characteristics of the generated graphs and, hence, requires hand-crafted sense alignments for the specific pair of resources to be aligned, a task which has to be replicated every time the resources are updated.

In this paper we propose a unified approach to aligning arbitrary pairs of lexical resources which is independent of their specific structure. Thanks to a novel modeling of the sense entries and an effective ontologization algorithm, our approach also fares well when resources lack relational structure or pair-specific training data is absent, meaning that it is applicable to arbitrary pairs without adaptation. We report state-of-the-art performance when aligning WordNet to Wikipedia, OmegaWiki and Wiktionary.

2 Resource Alignment

Preliminaries.

Our approach for aligning lexical resources exploits the graph structure of each resource. Therefore, we assume that a lexical resource $L$ can be represented as an undirected graph $G=(V,E)$ where $V$ is the set of nodes, i.e., the concepts defined in the resource, and $E$ is the set of undirected edges, i.e., semantic relations between concepts. Each concept $c\in V$ is associated with a set of lexicalizations $\mathcal{L}_{G}(c)=\{w_{1},w_{2},...,w_{n}\}$ . For instance, WordNet can be readily represented as an undirected graph $G$ whose nodes are synsets and edges are modeled after the relations between synsets defined in WordNet (e.g., hypernymy, meronymy, etc.), and $\mathcal{L}_{G}$ is the mapping between each synset node and the set of synonyms which express the concept. However, other resources such as Wiktionary do not provide semantic relations between concepts and, therefore, have first to be transformed into semantic networks before they can be aligned using our alignment algorithm. We explain in Section 3 how a semi-structured resource which does not exhibit a graph structure can be transformed into a semantic network.

Alignment algorithm.

Given a pair of lexical resources $L_{1}$ and $L_{2}$ , we align each concept in $L_{1}$ by mapping it to its corresponding concept(s) in the target lexicon $L_{2}$ . Algorithm 2 formalizes the alignment process: the algorithm takes as input the semantic graphs $G_{1}$ and $G_{2}$ corresponding to the two resources, as explained above, and produces as output an alignment in the form of a set $A$ of concept pairs. The algorithm iterates over all concepts $c_{1}\in V_{1}$ and, for each of them, obtains the set of concepts $C\subset V_{2}$ , which can be considered as alignment candidates for $c_{1}$ (line 2). For a concept $c_{1}$ , alignment candidates in $G_{2}$ usually consist of every concept $c_{2}\in V_{2}$ that shares at least one lexicalization with $c_{1}$ in the same part of speech tag, i.e., $\mathcal{L}_{G_{1}}(c_{1})\cap\mathcal{L}_{G_{2}}(c_{2})\neq\emptyset$ [31, 20]. Once the set of target candidates $C$ for a source concept $c_{1}$ is obtained, the alignment task can be cast as that of identifying those concepts in $C$ to which $c_{1}$ should be aligned. To do this, the algorithm calculates the similarity between $c_{1}$ and each $c_{2}\in C$ (line 2). If their similarity score exceeds a certain value denoted by $\theta$ (line 2), the two concepts $c_{1}$ and $c_{2}$ are aligned and the pair $(c_{1},c_{2})$ is added to $A$ (line 2).

Different resource alignment techniques usually vary in the way they compute the similarity of a pair of concepts across two resources (line 2 in Algorithm 2). In the following, we present our novel approach for measuring the similarity of concept pairs.

{algorithm}

[t!] Lexical Resource Aligner {algorithmic}[1] \REQUIREgraphs $H=(V_{H},E_{H})$ , $G_{1}=(V_{1},E_{1})$ and $G_{2}=(V_{2},E_{2})$ , the similarity threshold $\theta$ , and the combination parameter $\beta$ \ENSURE $A$ , the set of all aligned concept pairs

\STATE

$A\leftarrow\emptyset$

\FORALL

concept $c_{1}\in V_{1}$ \STATE $C$ $\leftarrow$ getCandidates $(c_{1},V_{2})$ \FORALLconcept $c_{2}\in C$ \STATE $s i m$ $\leftarrow$ calculateSimilarity $(H,G_{1},G_{2},c_{1},c_{2},\beta)$ \IF $sim>\theta$ \STATE $A$ $\leftarrow$ $A$ $\cup$ $\{(c_{1},c_{2})\}$ \ENDIF\ENDFOR

\ENDFOR\RETURN

$A$

Figure 1: The process of measuring the similarity of a pair of concepts across two resources. The method consists of two components: definitional and structural similarities, each measuring a similarity score for the given concept pair. The two scores are combined by means of parameter

\beta

in the last stage.

2.1 Measuring the Similarity of Concepts

Figure 1 illustrates the procedure underlying our cross-resource concept similarity measurement technique. As can be seen, the approach consists of two main components: definitional similarity and structural similarity. Each of these components gets, as its input, a pair of concepts belonging to two different semantic networks and produces a similarity score. These two scores are then combined into an overall score (part (e) of Figure 1) which quantifies the semantic similarity of the two input concepts $c_{1}$ and $c_{2}$ .

The definitional similarity component computes the similarity of two concepts in terms of the similarity of their definitions, a method that has also been used in previous work for aligning lexical resources [27, 12]. In spite of its simplicity, the mere calculation of the similarity of concept definitions provides a strong baseline, especially for cases where the definitional texts for a pair of concepts to be aligned are lexically similar, yet distinguishable from the other definitions. However, as mentioned in the introduction, definition similarity-based techniques fail at identifying the correct alignments in cases where different wordings are used or definitions are not of high quality. The structural similarity component, instead, is a novel graph-based similarity measurement technique which calculates the similarity between a pair of concepts across the semantic networks of the two resources by leveraging the semantic structure of those networks. This component goes beyond the surface realization of concepts, thus providing a deeper measure of concept similarity.

The two components share the same backbone (parts (b) and (d) of Figure 1), but differ in some stages (parts (a) and (c) in Figure 1). In the following, we explain all the stages involved in the two components (gray blocks in the figure).

2.1.1 Semantic signature generation

The aim of this stage is to model a given concept or set of concepts through a vectorial semantic representation, which we refer to as the semantic signature of the input. We utilized Personalized PageRank [10, ppr], a random walk graph algorithm, for calculating semantic signatures. The original PageRank (pr) algorithm [3] computes, for a given graph, a single vector wherein each node is associated with a weight denoting its structural importance in that graph. ppr is a variation of pr where the computation is biased towards a set of initial nodes in order to capture the notion of importance with respect to those particular nodes. ppr has been previously used in a wide variety of tasks such as definition similarity-based resource alignment [27], textual semantic similarity [14, 30], Word Sense Disambiguation [1, 6] and semantic text categorization [24]. When applied to a semantic graph by initializing the random walks from a set of concepts (nodes), ppr yields a vector in which each concept is associated with a weight denoting its semantic relevance to the initial concepts.

Formally, we first represent a semantic network consisting of $N$ concepts as a row-stochastic transition matrix $\mathbf{M}\in\mathbb{R}^{N\times N}$ . The cell $(i,j)$ in the matrix denotes the probability of moving from a concept $i$ to $j$ in the graph: 0 if no edge exists from $i$ to $j$ and $1/degree(i)$ otherwise. Then the ppr vector, hence the semantic signature $\mathcal{S}_{\mathbf{v}}$ of vector ${\mathbf{v}}$ is the unique solution to the linear system: $\mathcal{S}_{\mathbf{v}}=(1-\alpha)\,{\mathbf{v}}+\alpha\,\mathbf{M}\,\mathcal% {S}_{\mathbf{v}}$ , where ${\mathbf{v}}$ is the personalization vector of size $N$ in which all the probability mass is put on the concepts for which a semantic signature is to be computed and $\alpha$ is the damping factor, which is usually set to 0.85 [3]. We used the ukb¹¹http://ixa2.si.ehu.es/ukb/ off-the-shelf implementation of ppr.

Definitional similarity signature.

In the definitional similarity component, the two concepts $c_{1}$ and $c_{2}$ are first represented by their corresponding definitions $d_{1}$ and $d_{2}$ in the respective resources $L_{1}$ and $L_{2}$ (Figure 1(a), top). To improve expressiveness, we follow Niemann and Gurevych (2011) and further extend $d_{i}$ with all the word forms associated with concept $c_{i}$ and its neighbours, i.e., the union of all lexicalizations $\mathcal{L}_{G_{i}}(x)$ for all concepts $x\in\{c^{\prime}\in V_{i}:(c,c^{\prime})\in E_{i}\}\cup\{c\}$ , where $E_{i}$ is the set of edges in $G_{i}$ . In this component the personalization vector ${\mathbf{v}}_{i}$ is set by uniformly distributing the probability mass over the nodes corresponding to the senses of all the content words in the extended definition of $d_{i}$ according to the sense inventory of a semantic network $H$ . We use the same semantic graph $H$ for computing the semantic signatures of both definitions. Any semantic network with a dense relational structure, providing good coverage of the words appearing in the definitions, is a suitable candidate for $H$ . For this purpose we used the WordNet [7] graph which was further enriched by connecting each concept to all the concepts appearing in its disambiguated gloss.²²http://wordnet.princeton.edu

Structural similarity signature.

In the structural similarity component (Figure 1(b), bottom), the semantic signature for each concept $c_{i}$ is computed by running the ppr algorithm on its corresponding graph $G_{i}$ , hence a different $\mathbf{M}_{i}$ is built for each of the two concepts.

2.1.2 Signature unification

As mentioned earlier, semantic signatures are vectors with dimension equal to the number of nodes in the semantic graph. Since the structural similarity signatures $\mathcal{S}_{\mathbf{v}_{1}}$ and $\mathcal{S}_{\mathbf{v}_{2}}$ are calculated on different graphs and thus have different dimensions, we need to make them comparable by unifying them. We therefore propose an approach (part (c) of Figure 1) that finds a common ground between the two signatures: to this end we consider all the concepts associated with monosemous words in the two signatures as landmarks and restrict the two signatures exclusively to those common concepts. Leveraging monosemous words as bridges between two signatures is a particularly reliable technique as typically a significant portion of all words in a lexicon are monosemous.³³For instance, we calculated that more than 80% of the words in WordNet are monosemous, with over 60% of all the synsets containing at least one of them.

Formally, let $\mathcal{I}_{G}(w)$ be an inventory mapping function that maps a term $w$ to the set of concepts which are expressed by $w$ in graph $G$ . Then, given two signatures $\mathcal{S}_{\mathbf{v}_{1}}$ and $\mathcal{S}_{\mathbf{v}_{2}}$ , computed on the respective graphs $G_{1}$ and $G_{2}$ , we first obtain the set $\mathcal{M}$ of words that are monosemous according to both semantic networks, i.e., $\mathcal{M}=\{w\!:\!|\mathcal{I}_{G_{1}}(w)|\!=\!1\land|\mathcal{I}_{G_{2}}(w)% |\!=\!1\}$ . We then transform each of the two signatures $\mathcal{S}_{\mathbf{v}_{i}}$ into a new sub-signature $\mathcal{S}^{\prime}_{\mathbf{v}_{i}}$ whose dimension is $|\mathcal{M}|$ : the $k^{th}$ component of $\mathcal{S}^{\prime}_{\mathbf{v}_{i}}$ corresponds to the weight in $\mathcal{S}_{\mathbf{v}_{i}}$ of the only concept of $w_{k}$ in $\mathcal{I}_{G_{i}}(w_{k})$ . As an example, assume we are given two semantic signatures computed for two concepts in WordNet and Wiktionary. Also, consider the noun tradeoff which is monosemous according to both these resources. Then, each of the two unified sub-signatures will contain a component whose weight is determined by the weight of the only concept associated with tradeoff ${}_{n}$ in the corresponding semantic signature. As a result of the unification process, we obtain a pair of equally-sized semantic signatures with comparable components.

2.1.3 Signature comparison

Having at hand the semantic signatures for the two input concepts, we proceed to comparing them (part (d) in Figure 1). We leverage a non-parametric measure proposed by Pilehvar et al. (2013) which first transforms each signature into a list of sorted elements and then calculates the similarity on the basis of the average ranking of elements across the two lists:

Sim(\mathcal{S}_{\mathbf{v}_{1}},\mathcal{S}_{\mathbf{v}_{2}})=\frac{\sum_{i=1% }^{|T|}(r_{i}^{1}+r_{i}^{2})^{-1}}{\sum_{i=1}^{|T|}(2i)^{-1}}

(1)

where $T$ is the intersection of all concepts with non-zero probability in the two signatures and $r_{i}^{j}$ is the rank of the $i^{th}$ entry in the $j^{th}$ sorted list. The denominator is a normalization factor to guarantee a maximum value of one. The method penalizes the differences in the higher rankings more than it does for the lower ones. The measure was shown to outperform the conventional cosine distance when comparing different semantic signatures in multiple textual similarity tasks [30].

2.1.4 Score combination

Finally (part (e) of Figure 1), we calculate the overall similarity between two concepts as a linear combination of their definitional and structural similarities: $\beta\,Sim_{def}(\mathcal{S}_{\mathbf{v}_{1}},\mathcal{S}_{\mathbf{v}_{2}})+(1% -\beta)\,Sim_{str}(\mathcal{S}_{\mathbf{v}_{1}},\mathcal{S}_{\mathbf{v}_{2}}).$ In Section 4.2.1, we explain how we set, in our experiments, the values of $\beta$ and the similarity threshold $\theta$ (cf. alignment algorithm in Section 2).

3 Lexical Resource Ontologization

In Section 2, we presented our approach for aligning lexical resources. However, the approach assumes that the input resources can be viewed as semantic networks, which seems to limit its applicability to structured resources only. In order to address this issue and hence generalize our alignment approach to any given lexical resource, we propose a method for transforming a given machine-readable dictionary into a semantic network, a process we refer to as ontologization.

Our ontologization algorithm takes as input a lexicon $L$ and outputs a semantic graph $G=(V,E)$ where, as already defined in Section 2, $V$ is the set of concepts in $L$ and $E$ is the set of semantic relations between these concepts. Introducing relational links into a lexicon can be achieved in different ways. A first option is to extract binary relations between pairs of words from raw text. Both words in these relations, however, should be disambiguated according to the given lexicon [29], making the task particularly prone to mistakes due to the high number of possible sense pairings.

Here, we take an alternative approach which requires disambiguation on the target side only, hence reducing the size of the search space significantly. We first create the empty undirected graph $G_{L}=(V,E)$ such that $V$ is the set of concepts in $L$ and $E=\emptyset$ . For each source concept $c\in V$ we create a bag of content words $W=\{w_{1},\dots,w_{n}\}$ which includes all the content words in its definition $d$ and, if available, additional related words obtained from lexicon relations (e.g., synonyms in Wiktionary). The problem is then cast as a disambiguation task whose goal is to identify the intended sense of each word $w_{i}\in W$ according to the sense inventory of $L$ : if $w_{i}$ is monosemous, i.e., $|\{\mathcal{I}_{G_{L}}(w_{i})\}|=1$ , we connect our source concept $c$ to the only sense $c_{w_{i}}$ of $w_{i}$ and set $E:=E\cup\{\{c,c_{w_{i}}\}\}$ ; else, $w_{i}$ has multiple senses in $L$ . In this latter case, we choose the most appropriate concept $c_{i}\in\mathcal{I}_{G_{L}}(w_{i})$ by finding the maximal similarity between the definition of $c$ and the definitions of each sense of $w_{i}$ . To do this, we apply our definitional similarity measure introduced in Section 2.1. Having found the intended sense $\hat{c}_{w_{i}}$ of $w_{i}$ , we add the edge $\{c,\hat{c}_{w_{i}}\}$ to $E$ . As a result of this procedure, we obtain a semantic graph representation $G$ for the lexicon $L$ .

As an example, consider the 4 ${}^{th}$ sense of the noun cone in Wiktionary (i.e., cone ${}_{n}^{4}$ ) which is defined as “The fruit of a conifer”. The definition contains two content words: fruit ${}_{n}$ and conifer ${}_{n}$ . The latter word is monosemous in Wiktionary, hence we directly connect cone ${}_{n}^{4}$ to the only sense of conifer ${}_{n}$ . The noun fruit, however, has 5 senses in Wiktionary. We therefore measure the similarity between the definition of cone ${}_{n}^{4}$ and all the 5 definitions of fruit and introduce a link from cone ${}_{n}^{4}$ to the sense of fruit which yields the maximal similarity value (defined as “(botany) The seed-bearing part of a plant…”).

4 Experiments

Lexical resources.

To enable a comparison with the state of the art, we followed Matuschek and Gurevych (2013) and performed an alignment of WordNet synsets (wn) to three different collaboratively-constructed resources: Wikipedia (wp), Wiktionary (wt), and OmegaWiki (ow). We utilized the DKPro software [35, 9] to access the information in the foregoing three resources. For wp, wt, ow we used the dump versions $20090822$ , $20131002$ , and $20131115$ , respectively.

Evaluation measures.

We followed previous work [25, 17] and evaluated the alignment performance in terms of four measures: precision, recall, F1, and accuracy. Precision is the fraction of correct alignment judgments returned by the system and recall is the fraction of alignment judgments in the gold standard dataset that are correctly returned by the system. F1 is the harmonic mean of precision and recall. We also report results for accuracy which, in addition to true positives, takes into account true negatives, i.e., pairs which are correctly judged as unaligned.

Lexicons and semantic graphs.

Here, we describe how the four semantic graphs for our four lexical resources (i.e., wn, wp, wt, ow) were constructed. As mentioned in Section 2.1.1, we build the wn graph by including all the synsets and semantic relations defined in WordNet (e.g., hypernymy and meronymy) and further populate the relation set by connecting a synset to all the other synsets that appear in its disambiguated gloss. For wp, we used the graph provided by Matuschek and Gurevych (2013), constructed by directly connecting an article (concept) to all the hyperlinks in its first paragraph, together with the category links. Our wn and wp graphs have 118K and 2.8M nodes, respectively, with the average node degree being roughly 9 in both resources.

The other two resources, i.e., wt and ow, do not provide a reliable network of semantic relations, therefore we used our ontologization approach to construct their corresponding semantic graphs. We report, in the following subsection, the experiments carried out to assess the accuracy of our ontologization method, together with the statistics of the obtained graphs for wt and ow.

4.1 Ontologization Experiments

For ontologizing wt and ow, the bag of content words $W$ is given by the content words in sense definitions and, if available, additional related words obtained from lexicon relations (see Section 3). In wt, both of these are in word surface form and hence had to be disambiguated. For ow, however, the encoded relations, though relatively small in number, are already disambiguated and, therefore, the ontologization was just performed on the definition’s content words.

The resulting graphs for wt and ow contain 430K and 48K nodes, respectively, each providing more than 95% coverage of concepts, with the average node degree being around 10 for both resources. We present in Table 1, for wt and ow, the total number of edges together with their distribution across types (i.e., ambiguous and unambiguous) and sources (i.e., definitions and relations) from which candidate words were obtained.

Source	Type	wt	ow
Definition	Ambiguous	76.6%	50.7%
Definition	Unambiguous	18.3%	32.9%
Relation	Ambiguous	2.8%	-
Relation	Unambiguous	2.3%	16.4%
Total number of edges		2.1M	255K

Table 1: The statistics of the generated graphs for wt and ow. We report the distribution of the edges across types (i.e., ambiguous and unambiguous) and sources (i.e., definitions and relations) from which candidate words were obtained.

The edges obtained from unambiguous entries are essentially sense disambiguated on both sides whereas those obtained from ambiguous terms are a result of our similarity-based disambiguation. Hence, given that a large portion of edges came from ambiguous words (see Table 1), we carried out an experiment to evaluate the accuracy of our disambiguation method. To this end, we took as our benchmark the dataset provided by Meyer and Gurevych (2010) for evaluating relation disambiguation in wt. The dataset contains 394 manually-disambiguated relations. We compared our similarity-based disambiguation approach against the state of the art on this dataset, i.e., the wktwsd system, which is a wt relation disambiguation algorithm based on a series of rules [22].

Table 2 shows the performance of our disambiguation method, together with that of wktwsd, in terms of Precision (P), Recall (R), F1, and accuracy. The “Human” row corresponds to the inter-rater F1 and accuracy scores, i.e., the upperbound performance on this dataset, as calculated by Meyer and Gurevych (2010). As can be seen, our method proves to be very accurate, surpassing the performance of the wktwsd system in terms of precision, F1, and accuracy. This is particularly interesting as the wktwsd system uses a rule-based technique specific to relation disambiguation in wt, whereas our method is resource independent and can be applied to arbitrary words in the definition of any concept. We also note that the graph constructed by Meyer and Gurevych (2010) had an average node degree of around 1.

More recently, Matuschek and Gurevych (2013) leveraged monosemous linking (cf. Section 5) in order to create denser semantic graphs for ow and wt. Our approach, however, thanks to the connections obtained through ambiguous words, can provide graphs with significantly higher coverage. As an example, for wt, Matuschek and Gurevych (2013) generated a graph where around 30% of the nodes were in isolation, whereas this number drops to around 5% in our corresponding graph.

These results show that our ontologization approach can be used to obtain dense semantic graph representations of lexical resources, while at the same time preserving a high level of accuracy. Now that all the four resources are transformed into semantic graphs, we move to our alignment experiments.

Approach	P	R	F1	A
wktwsd	0.780	0.800	0.790	0.840
Our method	0.852	0.767	0.807	0.857
Human	-	-	0.890	0.910

Table 2: The performance of relation disambiguation for our similarity-based disambiguation method, as well as for the wktwsd system.

4.2 Alignment Experiments

4.2.1 Experimental setup

Approach	Training type	wn-wp				wn-wt				wn-ow
Approach	Training type	P	R	F1	A	P	R	F1	A	P	R	F1	A
sb	Cross-val.	0.780	0.780	0.780	0.950	0.670	0.650	0.660	0.910	0.749	0.691	0.716	0.886
dwsa	Tuning on subset	0.750	0.670	0.710	0.930	0.680	0.270	0.390	0.890	0.651	0.372	0.473	0.830
sb+dwsa	Cross-val. + tuning	0.750	0.870	0.810	0.950	0.680	0.710	0.690	0.920	0.794	0.688	0.735	0.898
SemAlign	Unsupervised	0.709	0.929	0.805	0.943	0.642	0.799	0.712	0.923	0.664	0.761	0.709	0.872
	Tuning on subset	0.877	0.792	0.833	0.960	0.672	0.799	0.730	0.930	0.750	0.717	0.733	0.893
	Cross-val.	0.852	0.835	0.840	0.965	0.680	0.769	0.722	0.931	0.778	0.725	0.749	0.900
	Tuning on wn-wp	-	-	-	-	0.754	0.627	0.684	0.931	0.825	0.584	0.684	0.889
	Tuning on wn-wt	0.738	0.934	0.824	0.950	-	-	-	-	0.805	0.677	0.736	0.900
	Tuning on wn-ow	0.744	0.925	0.824	0.950	0.684	0.766	0.723	0.930	-	-	-	-

Table 3: The performance of different systems on the task of aligning WordNet to Wikipedia (wn-wp), Wiktionary (wn-wt), and OmegaWiki (wn-ow) in terms of Precision (P), Recall (R), F1, and Accuracy (A). We present results for different configurations of our system (SemAlign), together with the state of the art in definition similarity-based alignment approaches (sb) and the best configuration of the state-of-the-art graph-based system, Dijkstra-WSA [17, dwsa].

Datasets.

As our benchmark we tested on the gold standard datasets used in Matuschek and Gurevych (2013) for three alignment tasks: WordNet-Wikipedia (wn-wp), WordNet-Wiktionary (wn-wt), and WordNet-OmegaWiki (wn-ow). However, the dataset for wn-ow was originally built for the German language and, hence, was missing many English ow concepts that could be considered as candidate target alignments. We therefore fixed the dataset for the English language and reproduced the performance of previous work on the new dataset. The three datasets contained 320, 484, and 315 wn concepts that were manually mapped to their corresponding concepts in wp, wt, and ow, respectively.

Configurations.

Recall from Section 2 that our resource alignment technique has two parameters: the similarity threshold $\theta$ and the combination parameter $\beta$ , both defined in [0, 1]. We performed experiments with three different configurations:

•

Unsupervised, where the two parameters are set to their middle values (i.e., 0.5), hence, no tuning is performed for either of the parameters. In this case, both the definitional and structural similarity scores are treated as equally important and two concepts are aligned if their overall similarity exceeds the middle point of the similarity scale.
•

Tuning, where we follow Matuschek and Gurevych (2013) and tune the parameters on a subset of the dataset comprising 100 items.
•

Cross-validation, where a 5-fold cross validation is carried out to find the optimal values for the parameters, a technique used in most of the recent alignment methods [27, 21, 17].

4.2.2 Results

We show in Table 3 the alignment performance of different systems on the task of aligning wn-wp, wn-wt, and wn-ow in terms of Precision (P), Recall (R), F1, and Accuracy. The sb system corresponds to the state-of-the-art definition similarity approaches for wn-wp [27], wn-wt [20], and wn-ow [9]. dwsa stands for Dijkstra-WSA, the state-of-the-art graph-based alignment approach of Matuschek and Gurevych (2013). The authors also provided results for SB+Dijkstra-WSA, a hybrid system where dwsa was tuned for high precision and, in the case when no alignment target could be found, the algorithm fell back on sb judgments. We also show the results for this system as sb+dwsa in the table.

For our approach (SemAlign) we show the results of six different runs each corresponding to a different setting. The first three (middle part of the table) correspond to the results obtained with the three configurations of SemAlign: unsupervised, with tuning on subset, and cross-validation (see Section 4.2.1). In addition to these, we performed experiments where the two parameters of SemAlign were tuned on pair-independent training data, i.e., a training dataset for a pair of resources different from the one being aligned. For this setting, we used the whole dataset of the corresponding resource pair to tune the two parameters of our system. We show the results for this setting in the bottom part of the table (last three lines).

Approach	wn-wp				wn-wt				wn-ow
Approach	P	R	F1	A	P	R	F1	A	P	R	F1	A
Dijkstra-WSA	0.750	0.670	0.710	0.930	0.680	0.270	0.390	0.890	0.651	0.372	0.473	0.830
SemAlign ${}_{str}$	0.877	0.788	0.830	0.959	0.604	0.643	0.623	0.907	0.654	0.602	0.627	0.853

Table 4: Performance of SemAlign when using only the structural similarity component (SemAlign

{}_{str}

) compared to the state-of-the-art graph-based alignment approach, Dijkstra-WSA [17] for our three resource pairs: WordNet to Wikipedia (wn-wp), Wiktionary (wn-wt), and OmegaWiki (wn-ow).

The main feature worth remarking upon is the consistency in the results across different resource pairs: the unsupervised system gains the best recall among the three configurations (with the improvement over sb+dwsa being always statistically significant⁴⁴All significance tests are done using z-test at p $<$ 0.05.) whereas tuning, both on a subset or through cross-validation, consistently leads to the best performance in terms of F1 and accuracy (with the latter being statistically significant with respect to sb+dwsa on wn-wp and wn-wt).

Moreover, the unsupervised system proves to be very robust inasmuch as it provides competitive results on all the three datasets, while it surpasses the performance of sb+dwsa on wn-wt. This is particularly interesting as the latter system involves tuning of several parameters, whereas SemAlign, in its unsupervised configuration, does not need any training data nor does it involve any tuning. In addition, as can be seen in the table, SemAlign benefits from pair-independent training data in most cases across the three resource pairs with performance surpassing that of sb+dwsa, a system which is dependent on pair-specific training data. The consistency in the performance of SemAlign in its different configurations and across different resource pairs indicates its robustness and shows that our system can be utilized effectively for aligning any pair of lexical resources, irrespective of their structure or availability of training data.

The system performance is generally higher on the alignment task for wp compared to wt and ow. We attribute this difference to the dictionary nature of the latter two, where sense distinctions are more fine-grained, as opposed to the relatively concrete concepts in the wp encyclopedia.

4.3 Similarity Measure Analysis

We explained in Section 2.1 that our concept similarity measure consists of two components: the definitional and the structural similarities. Measuring the similarity of two concepts in terms of their definitions has been investigated in previous work [27, 12]. The structural similarity component of our approach, however, is novel, but at the same time one of the very few measures which enables the computation of the similarity of concepts across two resources directly and independently of the similarity of their definitions. A comparable approach is the Dijkstra-WSA proposed by Matuschek and Gurevych (2013) which, as also mentioned earlier in the Introduction, first connects the two resources’ graphs by leveraging monosemous linking and then aligns two concepts across the two graphs on the basis of their shortest distance. To gain more insight into the effectiveness of our structural similarity measure in comparison to the Dijkstra-WSA method, we carried out an experiment where our alignment system used only the structural similarity component, a variant of our system we refer to as SemAlign ${}_{str}$ . Both systems (i.e., SemAlign ${}_{str}$ and Dijkstra-WSA) were tuned on 100-item subsets of the corresponding datasets.

We show in Table 4 the performance of the two systems on our three datasets. As can be seen in the table, SemAlign ${}_{str}$ consistently improves over Dijkstra-WSA according to recall, F1 and accuracy with all the differences in recall and accuracy being statistically significant (p $<$ 0.05). The improvement is especially noticeable for pairs involving either wt or ow where, thanks to the relatively denser semantic graphs obtained by means of our ontologization technique, the gap in F1 is about 0.23 (wn-wt) and 0.15 (wn-ow).

In addition, as we mentioned earlier, for wn-wp we used the same graph as that of Dijkstra-WSA, since both wn and wp provide a full-fledged semantic network and thus neither needed to be ontologized. Therefore, the considerable performance improvement over Dijkstra-WSA on this resource pair shows the effectiveness of our novel concept similarity measure independently of the underlying semantic network.

5 Related Work

Resource ontologization.

Having lexical resources represented as semantic networks is highly beneficial. A good example is WordNet, which has been exploited as a semantic network in dozens of NLP tasks [7]. A recent prominent case is Wikipedia [18, 13] which, thanks to its inter-article hyperlink structure, provides a rich backbone for structuring additional information [2, 34, 23, 8]. However, there are many large-scale resources, such as Wiktionary for instance, which by their very nature are not in the form of a graph. This is usually the case with machine-readable dictionaries, where structuring the resource involves the arduous task of connecting lexicographic senses by means of semantic relations. Surprisingly, despite their vast potential, little research has been conducted on the automatic ontologization of collaboratively-constructed dictionaries like Wiktionary and OmegaWiki. Meyer and Gurevych (2012a) and Matuschek and Gurevych (2013) provided approaches for building graph representations of Wiktionary and OmegaWiki. The resulting graphs, however, were either sparse or had a considerable portion of the nodes left in isolation. Our approach, in contrast, aims at transforming a lexical resource into a full-fledged semantic network, hence providing a denser graph with most of its nodes connected.

Resource alignment.

Aligning lexical resources has been a very active field of research in the last decade. One of the main objectives in this area has been to enrich existing ontologies by means of complementary information from other resources. As a matter of fact, most efforts have been concentrated on aligning the de facto community standard sense inventory, i.e. WordNet, to other resources. These include: the Roget’s thesaurus and Longman Dictionary of Contemporary English [15], FrameNet [16], VerbNet [33] or domain-specific terminologies such as the Unified Medical Language System [4]. More recently, the growth of collaboratively-constructed resources has seen the development of alignment approaches with Wikipedia [32, 2, 34, 31, 25], Wiktionary [20] and OmegaWiki [9]. Last year Matuschek and Gurevych (2013) proposed Dijkstra-WSA, a graph-based approach relying on shortest paths between two concepts when the two corresponding resources graphs were combined by leveraging monosemous linking. Their method when backed off with other definition similarity based approaches [27, 20], achieved state-of-the-art results on the mapping of WordNet to different collaboratively-constructed resources. This approach, however, in addition to setting the threshold for the definition similarity component by means of cross validation, also required other parameters to be tuned, such as the allowed path length ( $\lambda$ ) and the maximum number of edges in a graph. The optimal value for the $\lambda$ parameter varied from one resource pair to another, and even for a specific resource pair it had to be tuned for each configuration. This made the approach dependent on the training data for the specific pair of resources that were to be aligned. Instead of measuring the similarity of two concepts on the basis of their distance in the combined graph, our approach models each concept through a rich vectorial representation we refer to as semantic signature and compares the two concepts in terms of the similarity of their semantic signatures. This rich representation leads to our approach having a good degree of robustness such that it can achieve competitive results even in the absence of training data. This enables our system to be applied effectively for aligning new pairs of resources for which no training data is available, with state-of-the-art performance.

6 Conclusions

This paper presents a unified approach for aligning lexical resources. Our method leverages a novel similarity measure which enables a direct structural comparison of concepts across different lexical resources. Thanks to an effective ontologization method, our alignment approach can be applied to any pair of lexical resources independently of whether they provide a full-fledged network structure. We demonstrate that our approach achieves state-of-the-art performance on aligning WordNet to three collaboratively-constructed resources with different characteristics, i.e., Wikipedia, Wiktionary, and OmegaWiki. We also show that our approach is robust across its different configurations, even when the training data is absent, enabling it to be used effectively for aligning new pairs of lexical resources for which no resource-specific training data is available. In future work, we plan to extend our concept similarity measure across different natural languages. We release all our data at http://lcl.uniroma1.it/semalign.

Acknowledgments

The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234.

We would like to thank Michael Matuschek for providing us with Wikipedia graphs and alignment datasets.

References

[1] E. Agirre and A. Soroa(2009) Personalizing PageRank for Word Sense Disambiguation. Athens, Greece, pp. 33–41. Cited by: 2.1.1.
[2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak and Z. Ive(2007) DBpedia: a nucleus for a web of open data. Busan, Korea, pp. 722–735. Cited by: 5, 5.
[3] S. Brin and M. Page(1998) Anatomy of a large-scale hypertextual Web search engine. Brisbane, Australia, pp. 107–117. Cited by: 2.1.1, 2.1.1.
[4] A. Burgun and O. Bodenreider(2001) Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System. Pittsburgh, USA, pp. 77–82. Cited by: 5.
[5] G. de Melo and G. Weikum(2010) Providing multilingual, multimodal answers to lexical database queries. Valletta, Malta, pp. 348–355. Cited by: 1.
[6] S. Faralli and R. Navigli(2012) A New Minimally-supervised Framework for Domain Word Sense Disambiguation. Jeju, Korea, pp. 1411–1422. Cited by: 2.1.1.
[7] C. Fellbaum (Ed.)(1998) WordNet: an electronic database. MIT Press, Cambridge, MA. Cited by: 2.1.1, 5.
[8] T. Flati, D. Vannella, T. Pasini and R. Navigli(2014) Two is bigger (and better) than one: the Wikipedia Bitaxonomy Project. Baltimore, Maryland. Cited by: 5.
[9] I. Gurevych, J. Eckle-Kohler, S. Hartmann, M. Matuschek, C. M. Meyer and C. Wirth(2012) UBY - a large-scale unified lexical-semantic resource based on LMF. Avignon, France, pp. 580–590. Cited by: 4, 4.2.2, 5.
[10] T. H. Haveliwala(2002) Topic-sensitive PageRank. Hawaii, USA, pp. 517–526. Cited by: 2.1.1.
[11] V. Henrich, E. Hinrichs and T. Vodolazova(2011) Semi-automatic extension of GermaNet with sense definitions from Wiktionary. PoznaÅ, Poland, pp. 126–130. Cited by: 1.
[12] V. Henrich, E. W. Hinrichs and K. Suttner(2012) Automatically linking GermaNet to Wikipedia for harvesting corpus examples for GermaNet senses. In Journal for Language Technology and Computational Linguistics (JLCL) 27 (1), pp. 1–19. Cited by: 2.1, 4.3.
[13] E. H. Hovy, R. Navigli and S. P. Ponzetto(2013) Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence 194, pp. 2–27. Cited by: 1, 5.
[14] T. Hughes and D. Ramage(2007) Lexical semantic relatedness with random graph walks. Prague, Czech Republic, pp. 581–589. Cited by: 2.1.1.
[15] O. Y. Kwong(1998) Aligning WordNet with additional lexical resources. Montreal, Canada, pp. 73–79. Cited by: 5.
[16] E. Laparra and G. Rigau(2009) Integrating WordNet and FrameNet using a knowledge-based Word Sense Disambiguation algorithm. Borovets, Bulgaria., pp. 1–6. Cited by: 5.
[17] M. Matuschek and I. Gurevych(2013) Dijkstra-WSA: a graph-based approach to word sense alignment. Transactions of the Association for Computational Linguistics (TACL) 1, pp. 151–164. Cited by: 4.2.1, 4.2.1, 1, 4, 4, 4, 4.1, 4.2.1, 4.2.2, 4.3, 3, 4, 5, 5.
[18] O. Medelyan, D. Milne, C. Legg and I. H. Witten(2009) Mining meaning from Wikipedia. International Journal of Human-Computer Studies 67 (9), pp. 716–754. External Links: ISSN 1071-5819, Document Cited by: 5.
[19] C. M. Meyer and I. Gurevych(2010) “Worth its weight in gold or yet another resource”; a comparative study of Wiktionary, OpenThesaurus and GermaNet. CICLing’10, Iasi, Romania, pp. 38–49. External Links: ISBN 3-642-12115-2, 978-3-642-12115-9, Link, Document Cited by: 4.1, 4.1.
[20] C. M. Meyer and I. Gurevych(2011) What psycholinguists know about Chemistry: aligning Wiktionary and WordNet for increased domain coverage. Chiang Mai, Thailand, pp. 883–892. Cited by: 1, 2, 4.2.2, 5.
[21] C. M. Meyer and I. Gurevych(2012) OntoWiktionary: constructing an ontology from the collaborative online dictionary Wiktionary. Semi-Automatic Ontology Development: Processes and Resources, pp. 131–161. Cited by: 4.2.1, 5.
[22] C. M. Meyer and I. Gurevych(2012) To exhibit is not to loiter: a multilingual, sense-disambiguated Wiktionary for measuring verb similarity. Mumbai, India, pp. 1763–1780. Cited by: 4.1.
[23] A. Moro and R. Navigli(2013) Integrating syntactic and semantic analysis into the Open Information Extraction paradigm. Beijing, China, pp. 2148–2154. Cited by: 5.
[24] R. Navigli, S. Faralli, A. Soroa, O. de Lacalle and E. Agirre(2011) Two birds with one stone: learning semantic models for text categorization and Word Sense Disambiguation. Glasgow, UK, pp. 2317–2320. Cited by: 2.1.1.
[25] R. Navigli and S. P. Ponzetto(2012) BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193, pp. 217–250. Cited by: 1, 4, 5.
[26] R. Navigli(2006) Meaningful clustering of senses helps boost word sense disambiguation performance. Sydney, Australia, pp. 105–112. Cited by: 1.
[27] E. Niemann and I. Gurevych(2011) The people’s web meets linguistic knowledge: automatic sense alignment of Wikipedia and WordNet. Oxford, United Kingdom, pp. 205–214. Cited by: 4.2.1, 1, 2.1.1, 2.1.1, 2.1, 4.2.2, 4.3, 5.
[28] M. Palmer, D. Gildea and N. Xue(2010) Semantic role labeling. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers. Cited by: 1.
[29] P. Pantel and M. Pennacchiotti(2008) Automatically harvesting and ontologizing semantic relations. Amsterdam, The Netherlands, pp. 171–195. External Links: ISBN 978-1-58603-818-2 Cited by: 3.
[30] M. T. Pilehvar, D. Jurgens and R. Navigli(2013) Align, Disambiguate and Walk: a Unified Approach for Measuring Semantic Similarity. Sofia, Bulgaria, pp. 1341–1351. Cited by: 2.1.1, 2.1.3.
[31] N. Reiter, M. Hartung and A. Frank(2008) A resource-poor approach for linking ontology classes to Wikipedia articles. in J. Bos and R. Delmonte (Eds.), Semantics in Text Processing, Research in Computational Semantics, Vol. 1, pp. 381–387. Cited by: 2, 5.
[32] M. Ruiz-Casado, E. Alfonseca and P. Castells(2005) Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. Lodz, Poland, pp. 380–386. Cited by: 1, 5.
[33] L. Shi and R. Mihalcea(2005) Putting pieces together: combining FrameNet, VerbNet and WordNet for robust semantic parsing. Mexico City, Mexico, pp. 100–111. Cited by: 1, 5.
[34] F. M. Suchanek, G. Kasneci and G. Weikum(2008) YAGO: a large ontology from Wikipedia and WordNet. Journal of Web Semantics 6 (3), pp. 203–217. Cited by: 5, 5.
[35] T. Zesch, C. Müller and I. Gurevych(2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. Marrakech, Morocco. Cited by: 4.

Generated on Tue Jun 10 17:38:59 2014 by LaTeXML [LOGO]