We show that asymmetric models based on Tversky [19] improve correlations with human similarity judgments and nearest neighbor discovery for both frequent and middle-rank words. In accord with Tversky’s discovery that asymmetric similarity judgments arise when comparing sparse and rich representations, improvement on our two tasks can be traced to heavily weighting the feature bias toward the rarer word when comparing high- and mid-frequency words.
A key assumption of most models of similarity is that a similarity relation is symmetric. This assumption is foundational for some conceptions, such as the idea of a similarity space, in which similarity is the inverse of distance; and it is deeply embedded into many of the algorithms that build on a similarity relation among objects, such as clustering algorithms. The symmetry assumption is not, however, universal, and it is not essential to all applications of similarity, especially when it comes to modeling human similarity judgments. Citing a number of empirical studies, Tversky [19] calls symmetry directly into question, and proposes two general models that abandon symmetry. The one most directly related to a large body of word similarity work that followed is what he calls the ratio model, which defines as:
(1) |
Here A and B represent feature sets for the objects a and b respectively; the term in the numerator is a function of the set of shared features, a measure of similarity, and the last two terms in the denominator measure dissimilarity: and are real-number weights; when , symmetry is abandoned.
To motivate such a measure, Tversky presents experimental data with asymmetric similarity results, including similarity comparisons of countries, line drawings of faces, and letters. Tversky shows that many similarity judgment tasks have an inherent asymmetry; but he also argues, following Rosch [17], that certain kinds of stimuli are more naturally used as foci or standards than others. Goldstone [8] summarizes the results succinctly: “Asymmetrical similarity occurs when an object with many features is judged as less similar to a sparser object than vice versa; for example, North Korea is judged to be more like China than China is [like] North Korea.” Thus, one source of asymmetry is the comparison of sparse and dense representations.
The relevance of such considerations to word similarity becomes clear when we consider that for many applications, word similarity measures need to be well-defined when comparing very frequent words with infrequent words. To make this concrete, let us consider a word representation in the word-as-vector paradigm [11, 13], using a dependency-based model. Suppose we want to measure the semantic similarity of boat, rank 682 among the nouns in the BNC corpus studied below, which has 1057 nonzero dependency features based on 50 million words of data, with dinghy, rank 6200, which has only 113 nonzero features. At the level of the vector representations we are using, these are events of very different dimensionality; that is, there are ten times as many features in the representation of boat as there are in the representation of dinghy. If in Tversky/Rosch terms, the more frequent word is also a more likely focus, then this is exactly the kind of situation in which asymmetric similarity judgments will arise. Below we show that an asymmetric measure, using and biased in favor of the less frequent word, greatly improves the performance of a dependency-based vector model in capturing human similarity judgments.
Before presenting these results, it will be helpful to slightly reformulate and slightly generalize Tversky’s ratio model. The reformulation will allow us to directly draw the connection between the ratio model and a set of similarity measures that have played key roles in the similarity literature. First, since Tversky has primarily additive in mind, we can reformulate as follows
(2) |
Next, since we are interested in generalizing from sets of features, to real-valued vectors of features, , , we define
(3) |
Here si is some numerical operation on real-number feature values (si stands for shared information). If the operation is min and and both contain the feature weights for , then
so with si set to min, Equation (3) includes Equation (2) as a special case. Similarly, represents the summed feature weights of , and therefore,
In this generalized form, then, (1) becomes
(4) |
Thus, if , Tversky’s ratio model becomes simply:
(5) |
The computational advantage of this reformulation is that the core similarity operation is done on what is generally only a small number of shared features, and the calculations (which we will call self-similarities), can be computed in advance. Note that is symmetric if and only if . When , is biased in favor of as the referent; When , is biased in favor of .
Consider four similarity functions that have played important roles in the literature on similarity:
(6) |
The function dice prod is not well known in the word similarity literature, but in the data mining literature it is often just called Dice coefficient, because it generalized the set comparison function of Dice [5]. Observe that cosine is a special case of dice prod. was introduced in Curran [3] and was the most successful function in his evaluation. Since lin was introduced in Lin [13]; several different functions have born that name. The version used here is the one used in Curran [3].
The three distinct functions in Equation 6 have a similar form. In fact, all can be defined in terms of functions differing only in their si operation.
Let
be a shared feature sum for operation si,
as defined in Equation (3). We
define the Tversky-normalized version of ,
written , as:11
Paralleling (7) is
Jaccard-family normalization:
It is easy to generalize the result from
van Rijsbergen [20]
for the original set-specific versions
of Dice and Jaccard, and show that all of the
Tversky family functions discussed above are monotonic in Jaccard.
(7) |
Note that is just the special case of Tversky’s ratio model (5) in which and the similarity measure is symmetric.
We define three SI operations 22 , of course, is dot product. , , and as follows:
This yields the three similarity functions cited above:
(8) |
Thus, all three of these functions are special cases of symmetric ratio models. Below, we investigate asymmetric versions of all three, which we write as , defined as:
(9) |
Following Lee [11], who investigates a different family of asymmetric similarity functions, we will refer to these as -skewed measures.
We also will look at a rank-biased family of measures:
(10) |
Here, is as defined in (9), and the -weighted word is always the less frequent word. For example, consider comparing the 100-feature vector for dinghy to the 1000 feature vector for boat: if is high, we give more weight to the proportion of dinghy’s features that are shared than we give to the proportion of boat’s features that are shared.
In the following sections we present data showing that the performance of a dependency-based similarity system in capturing human similarity judgments can be greatly improved with rank-bias and -skewing. We will investigate the three asymmetric functions defined above.33Interestingly, Equation (9) does not yield an asymmetric version of cosine. Plugging unit vectors into the -skewed version of dice prod still leaves us with a symmetric function (cos), whatever the value of . We argue that the advantages of rank bias are tied to improved similarity estimation when comparing vectors of very different dimensionality. We then turn to the problem of finding a word’s nearest semantic neighbors. The nearest neighbor problem is a rather a natural ground in which to try out ideas on asymmetry, since the nearest neighbor relation is itself not symmetrical. We show that -skewing can be used to improve the quality of nearest neighbors found for both high- and mid- frequency words.
For each of the 3 rank-biased similarity systems () and cosine, we computed correlations with human judgments for the pairs in 2 standard wordsets: the combined Miller-Charles/Rubenstein-Goodenough word sets [15, 18] and the Wordsim 353 word set [6], as well as to a subset of the Wordsim set restricted to reflect semantic similarity judgments, which we will refer to as Wordsim 201.
For each of 3 -skewed similarity systems () and cosine, we found the nearest neighbor from among BNC nouns (of any rank) for the 10,000 most frequent BNC nouns using the the dependency DB created in step 2.
Table 1 presents the Spearman’s correlation with human judgments for Cosine, UKB, and our 3 -skewed models using Malt-parser based vectors applied to the combined Miller-Charles/Rubenstein-Goodenough word sets, the Wordsim 353 word set, and the Wordsim 202 word set.
MC/RG | Wdsm201 | Wdsm353 | |||||
---|---|---|---|---|---|---|---|
Dice | dice prod | .59 | .71 | .50 | .60 | .35 | .44 |
lin | .48 | .62 | .42 | .54 | .29 | .39 | |
.58 | .67 | .49 | .58 | .34 | .43 | ||
Euc | Cosine | .65 | NA | .56 | NA | .41 | NA |
WN | UKB WN | .80 | NA | .75 | NA | .68 | NA |
The first of each of the column pairs is a symmetric system, and the second a rank-biased variant, based on Equation (10). In all cases, the biased system improves on the performance of its symmetric counterpart; in the case of and dice prod, that improvement is enough for the biased system to outperform cosine, the best of the symmetric distributionally based systems. The value .97 was chosen for because it produced the best -system on the MC/RG corpus. That value is probably probably an overtrained optimum. The point is that -skewing always helps: For all three systems, the improvement shown in raising from .5 to whatever the optimum is is monotonic. This is shown in Figure 1. Table 2 shows very similar results using the Stanford parser, demonstrating the pattern is not limited to a single parsing model.
MC/RG | Wdsm201 | Wdsm353 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
opt | opt | opt | opt | opt | opt | |||||
dice prod | .65 | .70 | .86 | .42 | .57 | .99 | .36 | .44 | .98 | |
lin | .58 | .68 | .90 | .41 | .56 | .94 | .30 | .41 | .99 | |
.60 | .71 | .91 | .43 | .53 | .99 | .32 | .43 | .99 |
In Table 3, we list the pairs whose reranking
on the MC/RG dataset
contributed most to the improvement of the system over
the default system. In the last column an approximation
of the amount of correlation improvement provided by that pair ():44
The approximation is based on the formula for computing Spearman’s R
with no ties. If is the number of items, then the improvement on that item is:
Word 1 | Rank | Word 2 | Rank | ||||||
---|---|---|---|---|---|---|---|---|---|
automobile | 7411 | car | 100 | 0.030 | |||||
asylum | 3540 | madhouse | 14703 | 0.020 | |||||
coast | 708 | hill | 949 | 0.018 | |||||
mound | 3089 | stove | 2885 | 0.017 | |||||
autograph | 10136 | signature | 2743 | 0.009 |
Note the 3 of the 5 items contributing the most improvement this system were pairs with a large difference in rank. Choosing , weights recall toward the rarer word. We conjecture that the reason this helps is Tversky’s principle: It is natural to use the sparser representation as the focus in the comparison.
Figure 2 gives the results of our nearest neighbor study on the BNC for the case of dice prod. The graphs for the other two -skewed systems are nearly identical, and are not shown due to space limitations. The target word, the word whose nearest neighbor is being found, always receives the weight . The x-axis shows target word rank; the y-axis shows the average UKB similarity scores assigned to nearest neighbors every 50 ranks. All the systems show degraded nearest neighbor quality as target words grow rare, but at lower ranks, the nearest neighbor system fares considerably better than the symmetric system; the line across the bottom tracks the score of a system with randomly generated nearest neighbors. The symmetric dice prod system is as an excellent nearest neighbor system at high ranks but drops below the system at around rank 3500. We see that the system is even better than the symmetric system at high ranks, but degrades much more quickly.
We explain these results on the basis of the principle developed for the human correlation data: To reflect natural judgments of similarity for comparisons of representations of differing sparseness, should be tipped toward the sparser representation.
Thus, works best for high rank target words, because most nearest neighbor candidates are less frequent, and tips the balance toward the nontarget words. On the other hand, when the target word is a low ranking word, a high weight means it never receives the highest weight, and this is disastrous, since most good candidates are higher ranking. Conversely, works better.
The debt owed to Tversky [19] has been made clear in the introduction. Less clear is the debt owed to Jimenez et al. [9], which also proposes an asymmetric similarity framework based on Tversky’s insights. Jimenez et al. showed the continued relevance of Tversky’s work.
Motivated by the problem of measuring how well the distribution of one word captures the distribution of another , Weeds and Weir [21] also explore asymmetric models, expressing similarity calculations as weighted combinations of several variants of what they call precision and recall. Some of their models are also Tverskyan ratio models. To see this, we divide (9) everywhere by :
If the si is min, then the two terms in the denominator are the inverses of what W&W call difference-weighted precision and recall:
So for , (9) can be rewritten:
That is, is a weighted harmonic mean of precision and recall, the so-called weighted F-measure [14]. W&W’s additive precision/recall models appear not to be Tversky models, since they compute separate sums for precision and recall from the , one using , and one using .
Long before Weed and Weir, Lee [12] proposed an asymmetric similarity measure as well. Like Weeds and Weir, her perspective was to calculate the effectiveness of using one distribution as a proxy for the other, a fundamentally asymmetric problem. For distributions and , Lee’s -skew divergence takes the KL-divergence of a mixture of and from , using the parameter to define the proportions in the mixture.
We have shown that Tversky’s asymmetric ratio models
can improve performance in capturing human
judgments and produce better nearest neighbors.
To validate these very preliminary results,
we need to explore applications compatible
with asymmetry, such as the TOEFL-like synonym discovery task in
Freitag et al. [7],
and the PP-attachment task in Dagan et al. [4].
This work reported here was supported by NSF CDI grant # 1028177.