Using distributional analysis methods to compute semantic proximity links between words has become commonplace in NLP. The resulting relations are often noisy or difficult to interpret in general. This paper focuses on the issues of evaluating a distributional resource and filtering the relations it contains, but instead of considering it in abstracto, we focus on pairs of words in context. In a discourse, we are interested in knowing if the semantic link between two items is a by-product of textual coherence or is irrelevant. We first set up a human annotation of semantic links with or without contextual information to show the importance of the textual context in evaluating the relevance of semantic similarity, and to assess the prevalence of actual semantic relations between word tokens. We then built an experiment to automatically predict this relevance, evaluated on the reliable reference data set which was the outcome of the first annotation. We show that in-document information greatly improve the prediction made by the similarity level alone.
The goal of the work presented in this paper is to improve distributional thesauri, and to help evaluate the content of such resources. A distributional thesaurus is a lexical network that lists semantic neighbours, computed from a corpus and a similarity measure between lexical items, which generally captures the similarity of contexts in which the items occur. This way of building a semantic network has been very popular since [], even though the nature of the information it contains is hard to define, and its evaluation is far from obvious. A distributional thesaurus includes a lot of “noise” from a semantic point of view, but also lists relevant lexical pairs that escape classical lexical relations such as synonymy or hypernymy.
There is a classical dichotomy when evaluating NLP components between extrinsic and intrinsic evaluations [], and this applies to distributional thesauri []. Extrinsic evaluations measure the capacity of a system in which a resource or a component to evaluate has been used, for instance in this case information retrieval [] or word sense disambiguation []. Intrinsic evaluations try to measure the resource itself with respect to some human standard or judgment, for instance by comparing a distributional resource with respect to an existing synonym dictionary or similarity judgment produced by human subjects []. The shortcomings of these methods have been underlined in []. Lexical resources designed for other objectives put the spotlight on specific areas of the distributional thesaurus. They are not suitable for the evaluation of the whole range of semantic relatedness that is exhibited by distributional similarities, which exceeds the limits of classical lexical relations, even though researchers have tried to collect equivalent resources manually, to be used as a gold standard []. One advantage of distributional similarities is to exhibit a lot of different semantic relations, not necessarily standard lexical relations. Even with respect to established lexical resources, distributional approaches may improve coverage, complicating the evaluation even more.
The method we propose here has been designed as an intrinsic evaluation with a view to validate semantic proximity links in a broad perspective, to cover what [] call “non classical lexical semantic relations”. For instance, agentive relations (author/publish, author/publication) or associative relations (actor/cinema) should be considered. At the same time, we want to filter associations that can be considered as accidental in a semantic perspective (e.g. flag and composer are similar because they appear a lot with nationality names). We do this by judging the relevance of a lexical relation in a context where both elements of a lexical pair occur. We show not only that this improves the reliability of human judgments, but also that it gives a framework where this relevance can be predicted automatically. We hypothetize that evaluating and filtering semantic relations in texts where lexical items occur would help tasks that naturally make use of semantic similarity relations, but assessing this goes beyond the present work.
In the rest of this paper, we describe the resource we used as a case study, and the data we collected to evaluate its content (section 2). We present the experiments we set up to automatically filter semantic relations in context, with various groups of features that take into account information from the corpus used to build the thesaurus and contextual information related to occurrences of semantic neighbours 3). Finally we discuss some related work on the evaluation and improvement of distributional resources (section 4).
We use a distributional resource for French, built on a 200M word corpus extracted from the French Wikipedia, following principles laid out in [] from a structured model [], i.e. using syntactic contexts. In this approach, contexts are triples (governor,relation,dependent) derived from syntactic dependency structures. Governors and dependents are verbs, adjectives and nouns. Multiword units are available, but they form a very small subset of the resulting neighbours. Base elements in the thesaurus are of two types: arguments (dependents’ lemma) and predicates (governor+relation). This is to keep the predicate/argument distinction since similarities will be computed between predicate pairs or argument pairs, and a lexical item can appear in many predicates and as an argument (e.g. interest as argument, interest_for as one predicate). The similarity of distributions was computed with Lin’s score [].
We will talk of lexical neighbours or distributional neighbours to label pairs of predicates or arguments, and in the rest of the paper we consider only lexical pairs with a Lin score of at least 0.1, which means about 1.4M pairs. This somewhat arbitrary level is an a priori threshold to limit the resulting database, and it is conservative enough not to exclude potential interesting relations. The distribution of scores is given figure 1; 97% of the selected pairs have a score between 0.1 and 0.29.
To ease the use of lexical neighbours in our experiments, we merged together predicates that include the same lexical unit, a posteriori. Thus there is no need for a syntactic analysis of the context considered when exploiting the resource, and sparsity is less of an issue11Whenever two predicates with the same lemma have common neighbours, we average the score of the pairs..
[…]
Le ventre de l’impala de même que ses lèvres et sa queue sont blancs.
Il faut aussi mentionner leurs lignes noires uniques à chaque individu au bout des oreilles, sur le dos de la queue et sur le front. Ces lignes noires sont très utiles aux impalas puisque ce sont des signes qui leur permettent de se reconnaitre entre eux. Ils possèdent aussi des glandes sécrétant des odeurs sur les pattes arrières et sur le front. Ces odeurs permettent également aux individus de se reconnaitre entre eux. Il a également des coussinets noirs situés, à l’arrière de ses pattes.
Les impalas mâles et femelles ont une morphologie différente. En effet, on peut facilement distinguer un mâle par ses cornes en forme de S qui mesurent de 40 à 90 cm de long. Les impalas vivent dans les savanes où l’herbe (courte ou moyenne) abonde. Bien qu’ils apprécient la proximité d’une source d’eau, celle-ci n’est généralement pas essentielle aux impalas puisqu’ils peuvent se satisfaire de l’eau contenue dans l’herbe qu’ils consomment. Leur environnement est relativement peu accidenté et n’est composé que d’herbes, de buissons ainsi que de quelques arbres. […]
In order to evaluate the resource, we set up an annotation in context: pairs of lexical items are to be judged in their context of use, in texts where they occur together. To verify that this methodology is useful, we did a preliminary annotation to contrast judgment on lexical pairs with or without this contextual information. Then we made a larger annotation in context once we were assured of the reliability of the methodology.
For the preliminary test, we asked three annotators to judge the similarity of pairs of lexical items without any context (no-context), and to judge the similarity of pairs presented within a paragraph where they both occur (in context). The three annotators were linguists, and two of them (1 and 3) knew about the resource and how it was built. For each annotation, pairs were randomly selected, with the following constraints:
for the no-context annotation, candidate pairs had a Lin score above , which placed them in the top of lexical neighbours with respect to the similarity level.
for the in context annotation, the only constraint was that the pairs occur in the same paragraph somewhere in the corpus used to build the resource. The example paragraph was chosen at random.
The guidelines given in both cases were the same: “Do you think the two words are semantically close ? In other words, is there a semantic relation between them, either classical (synonymy, hypernymy, co-hyponymy, meronymy, co-meronymy) or not (the relation can be paraphrased but does not belong to the previous cases) ?”
For the pre-test, agreement was rather moderate without context (the average of pairwise kappas was .46), and much better with a context (average = .68), with agreement rates above 90%. This seems to validate the feasability of a reliable annotation of relatedness in context, so we went on for a larger annotation with two of the previous annotators.
Annotators | Non-contextual | Contextual | ||
Agreement rate | Kappa | Agreement rate | Kappa | |
N1+N2 | 0.52 | 0.66 | ||
N1+N3 | 0.36 | 0.69 | ||
N2+N3 | 0.50 | 0.69 | ||
Average | 0,46 | 0,68 | ||
Experts | NA | NA | 0.80 |
For the larger annotation, the protocol was slightly changed: two annotators were given 42 full texts from the original corpus where lexical neighbours occurred. They were asked to judge the relation between two items types, regardless of the number of occurrences in the text. This time there was no filtering of the lexical pairs beyond the 0.1 threshold of the original resource. We followed the well-known postulate [] that all occurrences of a word in the same discourse tend to have the same sense (“one sense per discourse”), in order to decrease the annotator workload. We also assumed that the relation between these items remain stable within the document, an arguably strong hypothesis that needed to be checked against inter-annotator agreement before beginning the final annotation . It turns out that the kappa score () shows a better inter-annotator agreement than during the preliminary test, which can be explained by the larger context given to the annotator (the whole text), and thus more occurrences of each element in the pair to judge, and also because the annotators were more experienced after the preliminary test. Agreement measures are summed-up table 1. An excerpt of an example text, as it was presented to the annotators, is shown figure 2.
Overall, it took only a few days to annotate 9885 pairs of lexical items. Among the pairs that were presented to the annotators, about 11% were judged as relevant by the annotators. It is not easy to decide if the non-relevant pairs are just noise, or context-dependent associations that were not present in the actual text considered (for polysemy reasons for instance), or just low-level associations. An important aspect is thus to guarantee that there is a correlation between the similarity score (Lin’s score here), and the evaluated relevance of the neighbour pairs. Pearson correlation factor shows that Lin score is indeed significantly correlated to the annotated relevance of lexical pairs, albeit not strongly ().
The produced annotation22Freely available, and distributed with this submission. can be used as a reference to explore various aspects of distributional resources, with the caveat that it is as such a bit dependent on the particular resource used. We nonetheless assume that some of the relevant pairs would appear in other thesauri, or would be of interest in an evaluation of another resource.
The first thing we can analyse from the annotated data is the impact of a threshold on Lin’s score to select relevant lexical pairs. The resource itself is built by choosing a cut-off which is supposed to keep pairs with a satisfactory similarity, but this threshold is rather arbitrary. Figure 3 shows the influence of the threshold value to select relevant pairs, when considering precision and recall of the pairs that are kept when choosing the threshold, evaluated against the human annotation of relevance in context. In case one wants to optimize the F-score (the harmonic mean of precision and recall) when extracting relevant pairs, we can see that the optimal point is at .24 for a threshold of .22 on Lin’s score. This can be considered as a baseline for extraction of relevant lexical pairs, to which we turn in the following section.
The outcome of the contextual annotation presented above is a rather sizeable dataset of validated semantic links, and we showed these linguistic judgments to be reliable. We used this dataset to set up a supervised classification experiment in order to automatically predict the relevance of a semantic link in a given discourse. We present now the list of features that were used for the model. They can be divided in three groups, according to their origin: they are computed from the whole corpus, gathered from the distributional resource, or extracted from the considered text which contains the semantic pair to be evaluated.
For each pair neighbour/neighbour, we computed a set of features from Wikipedia (the corpus used to derive the distributional similarity): We first computed the frequencies of each item in the corpus, and , from which we derive
, : the min and max of and ;
: the combination of the two, or
We also measured the syntagmatic association of neighbour and neighbour, with a mutual information measure [], computed from the cooccurrence of two tokens within the same paragraph in Wikipedia. This is a rather large window, and thus gives a good coverage with respect to the neighbour database (70% of all pairs).
A straightforward parameter to include to predict the relevance of a link is of course the similarity measure itself, here Lin’s information measure. But this can be complemented by additional information on the similarity of the neighbours, namely:
each neighbour productivity : and are defined as the numbers of neighbours of respectively neighbour and neighbour in the database (thus related tokens with a similarity above the threshold), from which we derive three features as for frequencies: the min, the max, and the log of the product. The idea is that neighbours whith very high productivity give rise to less reliable relations.
the ranks of tokens in other related items neighbours: is defined as the rank of neighbour among neighbours of neighbour ordered with respect to Lin’s score; is defined similarly and again we consider as features the min, max and log-product of these ranks.
We add two categorial features, of a more linguistic nature:
is the pair of part-of-speech for the related items, e.g. to distinguish the relevance of NN or VV pairs.
is related to the predicate/argument distinction: are the related items predicates or arguments ?
Feature | Description |
---|---|
Lin’s score | |
neighbour pos pair | |
predicate or argument | |
tfipf | tfipf(neighbour)tfipf(neighbour) |
copresence in a sentence | |
copresence in a paragraph | |
smallest distance between neighbour and neighbour | |
highest distance between neighbour and neighbour | |
average distance between neighbour and neighbour | |
belong to the same lexical connected component |
The last set of features derive from the occurrences of related tokens in the considered discourses:
First, we take into account the frequencies of items within the text, with three features as before: the min of the frequencies of the two related items, the max, and the log-product. Then we consider a tfidf [] measure, to evaluate the specificity and arguably the importance of a word in a document or within a document. Several variants of tfidf have been proposed to adapt the measure to more local areas in a text with respect to the whole document. For instance [] propose a tfisf(term frequency inverse sentence frequency), for topic segmentation. We similarly defined a tfipfmeasure based on the frequency of a word within a paragraph with respect to its frequency within the text. The resulting feature we used is the product of this measure for neighbour and neighbour.
A few other contextual features are included in the model: the distances between pairs of related items, instantiated as:
distance in words between occurrences of related word types:
minimal distance between two occurrences ()
maximal distance between two occurrences ()
average distance () ;
boolean features indicating whether neighbour and neighbour appear in the same sentence () or the same paragraph ().
Finally, we took into account the network of related lexical items, by considering the largest sets of words present in the text and connected in the database (self-connected components), by adding the following features:
the degree of each lemma, seen as a node in this similarity graph, combined as above in minimal degree of the pair, maximal degree, and product of degrees (, , ). This is the number of pairs (present in the text) where a lemma appears in.
a boolean feature saying whether a lexical pair belongs to a connected component of the text, except the largest. This reflects the fact that a small component may concern a lexical field which is more specific and thus more relevant to the text.
Figure 4 shows examples of self-connected components in an excerpt of the page on Gorille (gorilla), e.g. the set {pelage, dos, fourrure} (coat, back, fur).
Le gorille est après le bonobo et le chimpanzé , du de , l’ le plus de l’ . Cette a été par les entre les et les . Notre ne que de 2 % de celui du gorille . Redressés , les gorilles une de 1,75 , mais ils sont en un peu plus car ils ont les fléchis . L’ des la du et 2,75 . Il une de entre les : les de 90 à 150 et les jusqu’ à 275. En , particulièrement bien , ils 350 . Le du et de l’ . Chez les les plus se sur le une argenté , d’ où leur de “ argentés” . Le des gorilles de est particulièrement et soyeux . Comme tous les anthropoïdes , les gorilles sont dépourvus de . Leur est , le et les sont glabres et ils des torus supra-orbitaires marqués .
The last feature is probably not entirely independent from the productivity of an item, or from the tf.ipf measure.
Table 2 sums up the features used in our model.
Our task is to identify relevant similarities between lexical items, between all possible related pairs, and we want to train an inductive model, a classifier, to extract the relevant links. We have seen that the relevant/not relevant classification is very imbalanced, biased towards the “not relevant” category (about 11%/89%), so we applied methods dedicated to counter-balance this, and will focus on the precision and recall of the predicted relevant links.
Following a classical methodology, we made a 10-fold cross-validation to evaluate robustly the performance of the classifiers. We tested a few popular machine learning methods, and report on two of them, a naive bayes model and the best method on our dataset, the Random Forest classifier []. Other popular methods (maximum entropy, SVM) have shown slightly inferior combined F-score, even though precision and recall might yield more important variations. As a baseline, we can also consider a simple threshold on the lexical similarity score, in our case Lin’s measure, which we have shown to yield the best F-score of 24% when set at 0.22.
To address class imbalance, two broad types of methods can be applied to help the model focus on the minority class. The first one is to resample the training data to balance the two classes, the second one is to penalize differently the two classes during training when the model makes a mistake (a mistake on the minority class being made more costly than on the majority class). We tested the two strategies, by applying the classical Smote method of [] as a kind of resampling, and the ensemble method MetaCost of [] as a cost-aware learning method. Smote synthetizes and adds new instances similar to the minority class instances and is more efficient than a mere resampling. MetaCost is an interesting meta-learner that can use any classifier as a base classifier. We used Weka’s implementations of these methods [], and our experiments and comparisons are thus easily replicated on our dataset, provided with this paper, even though they can be improved by refinements of these techniques. We chose the following settings for the different models: naive bayes uses a kernel density estimation for numerical features, as this generally improves performance. For Random Forests, we chose to have ten trees, and each decision is taken on a randomly chosen set of five features. For resampling, Smote advises to double the number of instances of the minority class, and we observed that a bigger resampling degrades performances. For cost-aware learning, a sensible choice is to invert the class ratio for the cost ratio, i.e. here the cost of a mistake on a relevant link (false negative) is exactly 8.5 times higher than the cost on a non-relevant link (false positive), as non-relevant instances are 8.5 times more present than relevant ones.
We are interested in the precision and recall for the “relevant” class. If we take the best simple classifier (random forests), the precision and recall are and for an F-score of , and this is significantly beaten by the Naive Bayes method as precision and recall are more even (F-score of 41.5%). This is already a big improvement on the use of the similarity measure alone (24%). Also note that predicting every link as relevant would result in a 2.6% precision, and thus a 5% F-score. The random forest model is significantly improved by the balancing techniques: the overall best F-score of 46.3% is reached with Random Forests and the cost-aware learning method. Table 3 sums up the scores for the different configurations, with precision, recall, F-score and the confidence interval on the F-score. We analysed the learning curve by doing a cross-validation on reduced set of instances (from 10% to 90%); F1-scores range from 37.3% with 10% of instances and stabilize at 80%, with small increment in every case.
Method | Precision | Recall | F-score | CI |
---|---|---|---|---|
Baseline (Lin threshold) | 24.0 | 24.0 | 24.0 | |
RF | 68.1 | 24.2 | 35.7 | 3.4 |
NB | 34.8 | 51.3 | 41.5 | 2.6 |
RF+resampling | 56.6 | 32.0 | 40.9 | 3.3 |
NB+resampling | 32.8 | 54.0 | 40.7 | 2.5 |
RF+cost aware learning | 40.4 | 54.3 | 46.3 | 2.7 |
NB+cost aware learning | 27.3 | 61.5 | 37.8 | 2.2 |
Features | Prec. | Recall | F-score |
---|---|---|---|
all | 40.4 | 54.3 | 46.3 |
all corpus feat. | 37.4 | 52.8 | 43.8 |
all similarity feat. | 36.1 | 49.5 | 41.8 |
all contextual feat. | 36.5 | 54.8 | 43.8 |
The filtering approach we propose seems to yield good results, by augmenting the similarity built on the whole corpus with signals from the local contexts and documents where related lexical items appear together.
To try to analyse the role of each set of features, we repeated the experiment but changed the set of features used during training, and results are shown table 4 for the best method (RF with cost-aware learning).
We can see that similarity-related features (measures, ranks) have the biggest impact, but the other ones also seem to play a significant role. We can draw the tentative conclusion that the quality of distributional relations depends on the contextualizing of the related lexical items, beyond just the similarity score and the ranks of items as neighbours of other items.
Our work is related to two issues: evaluating distributional resources, and improving them. Evaluating distributional resources is the subject of a lot of methodological reflection [], and as we said in the introduction, evaluations can be divided between extrinsic and intrinsic evaluations. In extrinsic evaluations, models are evaluated against benchmarks focusing on a single task or a single aspect of a resource: either discriminative, TOEFL-like tests [], analogy production [], or synonym selection []. In intrinsic evaluations, associations norms are used, such as the 353 word-similarity dataset [], e.g. [], or specifically designed test cases, as in []. We differ from all these evaluation procedures as we do not focus on an essential view of the relatedness of two lexical items, but evaluate the link in a context where the relevance of the link is in question, an “existential” view of semantic relatedness.
As for improving distributional thesauri, outside of numerous alternate approaches to the construction, there is a body of work focusing on improving an existing resource, for instance reweighting context features once an initial thesaurus is built [], or post-processing the resource to filter bad neighbours or re-ranking neighbours of a given target []. They still use “essential” evaluation measures (mostly synonym extraction), although the latter comes close to our work since it also trains a model to detect (intrinsically) bad neighbours by using example sentences with the words to discriminate. We are not aware of any work that would try to evaluate differently semantic neighbours according to the context they appear in.
We proposed a method to reliably evaluate distributional semantic similarity in a broad sense by considering the validation of lexical pairs in contexts where they both appear. This helps cover non classical semantic relations which are hard to evaluate with classical resources. We also presented a supervised learning model which combines global features from the corpus used to built a distributional thesaurus and local features from the text where similarities are to be judged as relevant or not to the coherence of a document. It seems from these experiments that the quality of distributional relations depends on the contextualizing of the related lexical items, beyond just the similarity score and the ranks of items as neighbours of other items. This can hopefully help filter out lexical pairs when word lexical similarity is used as an information source where context is important: lexical disambiguation [], topic segmentation []. This can also be a preprocessing step when looking for similarities at higher levels, for instance at the sentence level [] or other macro-textual level [], since these are always aggregation functions of word similarities. There are limits to what is presented here: we need to evaluate the importance of the level of noise in the distributional neighbours database, or at least the quantity of non-semantic relations present, and this depends on the way the database is built. Our starting corpus is relatively small compared to current efforts in this framework. We are confident that the same methodology can be followed, even though the quantitative results may vary, since it is independent of the particular distributional thesaurus we used, and the way the similarities are computed.