Learning Word Sense Distributions, Detecting Unattested Senses and Identifying Novel Senses Using Topic Models

Jey Han Lau,

{}^{\spadesuit}

Paul Cook,

{}^{\heartsuit}

Diana McCarthy,

{}^{\diamondsuit}

Spandana Gella,

{}^{\heartsuit}

Timothy Baldwin

{}^{\heartsuit}

\spadesuit

Dept of Philosophy, King’s College London

\heartsuit

Dept of Computing and Information Systems, The University of Melbourne

\diamondsuit

University of Cambridge
\smaller[1]jeyhan.lau@gmail.com, \smaller[1]paulcook@unimelb.edu.au,
\smaller[1]diana@dianamccarthy.co.uk, \smaller[1]spandanagella@gmail.com, \smaller[1]tb@ldwin.net

Abstract

Unsupervised word sense disambiguation (wsd) methods are an attractive approach to all-words wsddue to their non-reliance on expensive annotated data. Unsupervised estimates of sense frequency have been shown to be very useful for wsddue to the skewed nature of word sense distributions. This paper presents a fully unsupervised topic modelling-based approach to sense frequency estimation, which is highly portable to different corpora and sense inventories, in being applicable to any part of speech, and not requiring a hierarchical sense inventory, parsing or parallel text. We demonstrate the effectiveness of the method over the tasks of predominant sense learning and sense distribution acquisition, and also the novel tasks of detecting senses which aren’t attested in the corpus, and identifying novel senses in the corpus which aren’t captured in the sense inventory.

1 Introduction

The automatic determination of word sense information has been a long-term pursuit of the NLP community []. Word sense distributions tend to be Zipfian, and as such, a simple but surprisingly high-accuracy back-off heuristic for word sense disambiguation (wsd) is to tag each instance of a given word with its predominant sense []. Such an approach requires knowledge of predominant senses; however, word sense distributions — and predominant senses too — vary from corpus to corpus. Therefore, methods for automatically learning predominant senses and sense distributions for specific corpora are required [12].

In this paper, we propose a method which uses topic models to estimate word sense distributions. This method is in principle applicable to all parts of speech, and moreover does not require a parser, a hierarchical sense representation or parallel text. Topic models have been used for wsdin a number of studies [1, 13, 19, 3, 11], but our work extends significantly on this earlier work in focusing on the acquisition of prior word sense distributions (and predominant senses).

Because of domain differences and the skewed nature of word sense distributions, it is often the case that some senses in a sense inventory will not be attested in a given corpus. A system capable of automatically finding such senses could reduce ambiguity, particularly in domain adaptation settings, while retaining rare but nevertheless viable senses. We further propose a method for applying our sense distribution acquisition system to the task of finding unattested senses — i.e., senses that are in the sense inventory but not attested in a given corpus. In contrast to the previous work of McCarthy et al. (2004) on this topic which uses the sense ranking score from to remove low-frequency senses from \smallerWordNet, we focus on finding senses that are unattested in the corpus on the premise that, given accurate disambiguation, rare senses in a corpus contribute to correct interpretation.

Corpus instances of a word can also correspond to senses that are not present in a given sense inventory. This can be due to, for example, words taking on new meanings over time (e.g. the relatively recent senses of tablet and swipe related to touchscreen computers) or domain-specific terms not being included in a more general-purpose sense inventory. A system for automatically identifying such novel senses — i.e. senses that are attested in the corpus but not in the sense inventory — would be a very valuable lexicographical tool for keeping sense inventories up-to-date []. We further propose an application of our proposed method to the identification of such novel senses. In contrast to , the use of topic models makes this possible, using topics as a proxy for sense []. Earlier work on identifying novel senses focused on individual tokens [7], whereas our approach goes further in identifying groups of tokens exhibiting the same novel sense.

2 Background and Related Work

There has been a considerable amount of research on representing word senses and disambiguating usages of words in context (wsd) as, in order to produce computational systems that understand and produce natural language, it is essential to have a means of representing and disambiguating word sense. wsdalgorithms require word sense information to disambiguate token instances of a given ambiguous word, e.g. in the form of sense definitions [], semantic relationships [17] or annotated data [20]. One extremely useful piece of information is the word sense prior or expected word sense frequency distribution. This is important because word sense distributions are typically skewed [], and systems do far better when they take bias into account [].

Typically, word frequency distributions are estimated with respect to a sense-tagged corpus such as SemCor [15], a 220,000 word corpus tagged with \smallerWordNet [] senses. Due to the expense of hand tagging, and sense distributions being sensitive to domain and genre, there has been some work on trying to estimate sense frequency information automatically [5, 16, 6]. Much of this work has been focused on ranking word senses to find the predominant sense in a given corpus [16], which is a very powerful heuristic approach to wsd. Most wsdsystems rely upon this heuristic for back-off in the absence of strong contextual evidence []. proposed a method which relies on distributionally similar words (nearest neighbours) associated with the target word in an automatically acquired thesaurus []. The distributional similarity scores of the nearest neighbours are associated with the respective target word senses using a \smallerWordNetsimilarity measure, such as those proposed by and . The word senses are ranked based on these similarity scores, and the most frequent sense is selected for the corpus that the distributional similarity thesaurus was trained over.

As well as sense ranking for predominant sense acquisition, automatic estimates of sense frequency distribution can be very useful for wsdfor training data sampling purposes [], entropy estimation [], and prior probability estimates, all of which can be integrated within a wsdsystem [5, 6, 12]. Various approaches have been adopted, such as normalizing sense ranking scores to obtain a probability distribution [], using subcategorisation information as an indication of verb sense [12] or alternatively using parallel text [5, 6].

The work of Boyd-Graber and Blei (2007) is highly related in that it extends the method of to provide a generative model which assumes the words in a given document are generated according to the topic distribution appropriate for that document. They then predict the most likely sense for each word in the document based on the topic distribution and the words in context (“corroborators”), each of which, in turn, depends on the document’s topic distribution. Using this approach, they get comparable results to McCarthy et al. when context is ignored (i.e. using a model with one topic), and at most a 1% improvement on SemCor when they use more topics in order to take context into account. Since the results do not improve on McCarthy et al. as regards sense distribution acquisition irrespective of context, we will compare our model with that proposed by McCarthy et al.

Recent work on finding novel senses has tended to focus on comparing diachronic corpora [] and has also considered topic models []. In a similar vein, Peirsman et al. (2010) considered the identification of words having a sense particular to one language variety with respect to another (specifically Belgian and Netherlandic Dutch). In contrast to these studies, we propose a model for comparing a corpus with a sense inventory. Carpuat et al. (2013) exploit parallel corpora to identify words in domain-specific monolingual corpora with previously-unseen translations; the method we propose does not require parallel data.

3 Methodology

Our methodology is based on the WSI system described in ,¹¹Based on the implementation available at: https://github.com/jhlau/hdp-wsi which has been shown [] to achieve state-of-the-art results over the WSI tasks from SemEval-2007 [], SemEval-2010 [] and SemEval-2013 []. The system is built around a Hierarchical Dirichlet Process (HDP: ), a non-parametric variant of a Latent Dirichlet Allocation topic model [] where the model automatically optimises the number of topics in a fully-unsupervised fashion over the training data.

To learn the senses of a target lemma, we train a single topic model per target lemma. The system reads in a collection of usages of that lemma, and automatically induces topics (= senses) in the form of a multinomial distribution over words, and per-usage topic assignments (= probabilistic sense assignments) in the form of a multinomial distribution over topics. Following , we assign one topic to each usage by selecting the topic that has the highest cumulative probability density, based on the topic allocations of all words in the context window for that usage.²²This includes all words in the usage sentence except stopwords, which were filtered in the preprocessing step. Note that in their original work, experimented with the use of features extracted from a dependency parser. Due to the computational overhead associated with these features, and the fact that the empirical impact of the features was found to be marginal, we make no use of parser-based features in this paper.³³For hyper-parameters $\alpha$ and $\gamma$ , we used 0.1 for both. We did not tune the parameters, and opted to use the default parameters introduced in .

The induced topics take the form of word multinomials, and are often represented by the top- $N$ words in descending order of conditional probability. We interpret each topic as a sense of the target lemma.⁴⁴To avoid confusion, we will refer to the HDP-induced topics as topics, and reserve the term sense to denote senses in a sense inventory. To illustrate this, we give the example of topics induced by the HDP model for network in Table 1.

We refer to this method as HDP-WSIhenceforth.⁵⁵The code used to learn predominant sense and run all experiments described in this paper is available at: https://github.com/jhlau/predom_sense.

Topic Num	Top-10 Terms
1	network support @card@ information research service group development community member
2	service @card@ road company transport rail area government network public
3	network social model system family structure analysis form relationship neural
4	network @card@ computer system service user access internet datum server
5	system network management software support corp company service application product
6	@card@ radio news television show bbc programme call think film
7	police drug criminal terrorist intelligence network vodafone iraq attack cell
8	network atm manager performance craigavon group conference working modelling assistant
9	root panos comenius etd unipalm lse brazil telephone xxx discuss

Table 1: An example to illustrate the topics induced for network by the HDP model. The top-10 highest probability terms are displayed to represent each topic (@card@ denotes a tokenised cardinal number).

In predominant sense acquisition, the task is to learn, for each target lemma, the most frequently occurring word sense in a particular domain or corpus, relative to a predefined sense inventory. The WSI system provides us with a topic allocation per usage of a given word, from which we can derive a distribution of topics over usages and a predominant topic. In order to map this onto the predominant sense, we need to have some way of aligning a topic with a sense. We design our topic–sense alignment methodology with portability in mind — it should be applicable to any sense inventory. As such, our alignment methodology assumes only that we have access to a conventional sense gloss or definition for each sense, and does not rely on ontological/structural knowledge (e.g. the \smallerWordNethierarchy).

To compute the similarity between a sense and a topic, we first convert the words in the gloss/definition into a multinomial distribution over words, based on simple maximum likelihood estimation.⁶⁶Words are tokenised using OpenNLP and lemmatised with Morpha []. We additionally remove the target lemma, stopwords and words that are less than 3 characters in length. We then calculate the Jensen–Shannon divergence between the multinomial distribution (over words) of the gloss and that of the topic, and convert the divergence value into a similarity score by subtracting it from $1$ . Formally, the similarity sense $s_{i}$ and topic $t_{j}$ is:

\displaystyle\text{sim}(s_{i},t_{j})

\displaystyle=1-\text{JS}(S\|T)

(1)

where $S$ and $T$ are the multinomial distributions over words for sense $s_{i}$ and topic $t_{j}$ , respectively, and $\text{JS}(X\|Y)$ is the Jensen–Shannon divergence for distribution $X$ and $Y$ .

To learn the predominant sense, we compute the prevalence score of each sense and take the sense with the highest prevalence score as the predominant sense. The prevalence score for a sense is computed by summing the product of its similarity scores with each topic (i.e. $\text{sim}(s_{i},t_{j})$ ) and the prior probability of the topic in question (based on maximum likelihood estimation). Formally, the prevalence score of sense $s_{i}$ is given as follows:

	$\displaystyle\text{prevalence}(s_{i})$	$\displaystyle=\sum_{j}^{T}\left(\text{sim}(s_{i},t_{j})\times P(t_{j})\right)$		(2)
		$\displaystyle=\sum_{j}^{T}\left(\text{sim}(s_{i},t_{j})\times\frac{f(t_{j})}{% \sum_{k}^{T}f(t_{k})}\right)$

where $f(t_{j})$ is the frequency of topic $t_{j}$ (i.e. the number of usages assigned to topic $t_{j}$ ), and $T$ is the number of topics.

The intuition behind the approach is that the predominant sense should be the sense that has relatively high similarity (in terms of lexical overlap) with high-probability topic(s).

4 \smallerWordNetExperiments

We first test the proposed method over the tasks of predominant sense learning and sense distribution induction, using the \smallerWordNet-tagged dataset of , which is made up of 3 collections of documents: a domain-neutral corpus (BNC), and two domain-specific corpora (SPORTSand FINANCE). For each domain, annotators were asked to sense-annotate a random selection of sentences for each of 40 target nouns, based on \smallerWordNetv1.7. The predominant sense and distribution across senses for each target lemma was obtained by aggregating over the sense annotations. The authors evaluated their method in terms of wsdaccuracy over a given corpus, based on assigning all instances of a target word with the predominant sense learned from that corpus. For the remainder of the paper, we denote their system as MKWC.

To compare our system (HDP-WSI) with MKWC, we apply it to the three datasets of . For each dataset, we use HDP to induce topics for each target lemma, compute the similarity between the topics and the \smallerWordNetsenses (Equation (1)), and rank the senses based on the prevalence scores (Equation (2)). In addition to the wsdaccuracy based on the predominant sense inferred from a particular corpus, we additionally compute: (1) $\text{Acc}_{\text{$\text{\sc ub}$}}$ , the upper bound for the first sense-based wsdaccuracy (using the gold standard predominant sense for disambiguation);⁷⁷The upper bound for a wsdapproach which tags all token occurrences of a given word with the same sense, as a first step towards context-sensitive unsupervised wsd. and (2) ERR, the error rate reduction between the accuracy for a given system ( $\text{Acc}_{\text{}}$ ) and the upper bound ( $\text{Acc}_{\text{$\text{\sc ub}$}}$ ), calculated as follows:

\displaystyle\text{ERR}=1-\frac{\text{Acc}_{\text{$\text{\sc ub}$}}-\text{Acc}% _{\text{}}}{\text{Acc}_{\text{$\text{\sc ub}$}}}

Looking at the results in Table 2, we see little difference in the results for the two methods, with MKWCperforming better over two of the datasets (BNCand SPORTS) and HDP-WSIperforming better over the third (FINANCE), but all differences are small. Based on the McNemar’s Test with Yates correction for continuity, MKWCis significantly better over BNCand HDP-WSIis significantly better over FINANCE( $p<0.0001$ in both cases), but the difference over SPORTSis not statistically significance ( $p>0.1$ ). Note that there is still much room for improvement with both systems, as we see in the gap between the upper bound (based on perfect determination of the first sense) and the respective system accuracies.

Given that both systems compute a continuous-valued prevalence score for each sense of a target lemma, a distribution of senses can be obtained by normalising the prevalence scores across all senses. The predominant sense learning task of evaluates the ability of a method to identify only the head of this distribution, but it is also important to evaluate the full sense distribution []. To this end, we introduce a second evaluation metric: the Jensen–Shannon (JS) divergence between the inferred sense distribution and the gold-standard sense distribution, noting that smaller values are better in this case, and that it is now theoretically possible to obtain a JS divergence of 0 in the case of a perfect estimate of the sense distribution. Results are presented in Table 3.

HDP-WSIconsistently achieves lower JS divergence, indicating that the distribution of senses that it finds is closer to the gold standard distribution. Testing for statistical significance over the paired JS divergence values for each lemma using the Wilcoxon signed-rank test, the result for FINANCEis significant ( $p<0.05$ ) but the results for the other two datasets are not ( $p>0.1$ in each case).

\smaller

Dataset	FS ${}_{\text{\sc corpus}}$	MKWC		HDP-WSI
Dataset	$\text{Acc}_{\text{$\text{\sc ub}$}}$	$\text{Acc}_{\text{}}$	ERR	$\text{Acc}_{\text{}}$	ERR
BNC	0.524	0.407	\smaller(0.777)	0.376	\smaller(0.718)
FINANCE	0.801	0.499	\smaller(0.623)	0.555	\smaller(0.693)
SPORTS	0.774	0.437	\smaller(0.565)	0.422	\smaller(0.545)

Table 2: wsdaccuracy for MKWCand HDP-WSIon the \smallerWordNet-annotated datasets, as compared to the upper-bound based on actual first sense in the corpus (higher values indicate better performance; the best system in each row [other than the FS

{}_{\text{\sc corpus}}

upper bound] is indicated in boldface).

\smaller

Dataset	MKWC	HDP-WSI
BNC	0.226	0.214
FINANCE	0.426	0.375
SPORTS	0.420	0.363

Table 3: Sense distribution evaluation of MKWCand HDP-WSIon the \smallerWordNet-annotated datasets, evaluated using JS divergence (lower values indicate better performance; the best system in each row is indicated in boldface).

\smaller

Dataset	FS ${}_{\text{\sc corpus}}$	FS ${}_{\text{\sc dict}}$		HDP-WSI
Dataset	$\text{Acc}_{\text{$\text{\sc ub}$}}$	$\text{Acc}_{\text{}}$	ERR	$\text{Acc}_{\text{}}$	ERR
ukWaC	0.574	0.387	\smaller(0.674)	0.514	\smaller(0.895)
Twitter	0.468	0.297	\smaller(0.635)	0.335	\smaller(0.716)

Table 4: wsdaccuracy for HDP-WSIon the \smallerMacmillan-annotated datasets, as compared to the upper-bound based on actual first sense in the corpus (higher values indicate better performance; the best system in each row [other than the FS

{}_{\text{\sc corpus}}

upper bound] is indicated in boldface).

\smaller

Dataset	FS ${}_{\text{\sc corpus}}$	FS ${}_{\text{\sc dict}}$	HDP-WSI
ukWaC	0.210	0.393	0.156
Twitter	0.259	0.472	0.171

Table 5: Sense distribution evaluation of HDP-WSIon the \smallerMacmillan-annotated datasets as compared to corpus- and dictionary-based first sense methods, evaluated using JS divergence (lower values indicate better performance; the best system in each row is indicated in boldface).

To summarise, the results for MKWCand HDP-WSIare fairly even for predominant sense learning (each outperforms the other at a level of statistical significance over one dataset), but HDP-WSIis better at inducing the overall sense distribution.

It is important to bear in mind that MKWCin these experiments makes use of full-text parsing in calculating the distributional similarity thesaurus, and the \smallerWordNetgraph structure in calculating the similarity between associated words and different senses. Our method, on the other hand, uses no parsing, and only the synset definitions (and not the graph structure) of \smallerWordNet.⁸⁸ obtained good results with definition overlap, but their implementation uses the relation structure alongside the definitions []. Iida et al. (2008) demonstrate that further extensions using distributional data are required when applying the method to resources without hierarchical relations. The non-reliance on parsing is significant in terms of portability to text sources which are less amenable to parsing (such as Twitter: []), and the non-reliance on the graph structure of \smallerWordNetis significant in terms of portability to conventional “flat” sense inventories. While comparable results on a different dataset have been achieved with a proximity thesaurus [] compared to a dependency one,⁹⁹The thesauri used in the reimplementation of MKWCin this paper were obtained from http://webdocs.cs.ualberta.ca/~lindek/downloads.htm. it is not stated how wide a window is needed for the proximity thesaurus. This could be a significant issue with Twitter data, where context tends to be limited. In the next section, we demonstrate the robustness of the method in experimenting with two new datasets, based on Twitter and a web corpus, and the \smallerMacmillan English Dictionary.

5 \smallerMacmillanExperiments

In our second set of experiments, we move to a new dataset [9] based on text from ukWaC [8] and Twitter, and annotated using the \smallerMacmillan English Dictionary¹⁰¹⁰http://www.macmillandictionary.com/ (henceforth “\smallerMacmillan”). For the purposes of this research, the choice of \smallerMacmillanis significant in that it is a conventional dictionary with sense definitions and examples, but no linking between senses.¹¹¹¹Strictly speaking, there is limited linking in the form of sets of synonyms in \smallerMacmillan, but we choose to not use this information in our research. In terms of the original research which gave rise to the sense-tagged dataset, \smallerMacmillanwas chosen over \smallerWordNetfor reasons including: (1) the well-documented difficulties of sense tagging with fine-grained \smallerWordNetsenses []; (2) the regular update cycle of \smallerMacmillan(meaning it contains many recently-emerged senses); and (3) the finding in a preliminary sense-tagging task that it better captured Twitter usages than \smallerWordNet(and also \smallerOntoNotes: ).

The dataset is made up of 20 target nouns which were selected to span the high- to mid-frequency range in both Twitter and the ukWaC corpus, and have at least 3 \smallerMacmillansenses. The average sense ambiguity of the 20 target nouns in \smallerMacmillanis 5.6 (but 12.3 in \smallerWordNet). 100 usages of each target noun were sampled from each of Twitter (from a crawl over the time period Jan 3–Feb 28, 2013 using the Twitter Streaming API) and ukWaC, after language identification using langid.py[] and POS tagging (based on the CMU ARK Twitter POS tagger v2.0 [] for Twitter, and the POS tags provided with the corpus for ukWaC). Amazon Mechanical Turk (AMT) was then used to 5-way sense-tag each usage relative to \smallerMacmillan, including allowing the annotators the option to label a usage as “Other” in instances where the usage was not captured by any of the \smallerMacmillansenses. After quality control over the annotators/annotations (see Gella et al. (to appear) for details), and aggregation of the annotations into a single sense per usage (possibly “Other”), there were 2000 sense-tagged ukWaC sentences and Twitter messages over the 20 target nouns. We refer to these two datasets as ukWaCand Twitterhenceforth.

To apply our method to the two datasets, we use HDP-WSIto train a model for each target noun, based on the combined set of usages of that lemma in each of the two background corpora, namely the original Twitter crawl that gave rise to the Twitterdataset, and all of ukWaC.

5.1 Learning Sense Distributions

As in Section 4, we evaluate in terms of wsdaccuracy (Table 4) and JS divergence over the gold-standard sense distribution (Table 5). We also present the results for: (a) a supervised baseline (“FS ${}_{\text{\sc corpus}}$ ”), based on the most frequent sense in the corpus; and (b) an unsupervised baseline (“FS ${}_{\text{\sc dict}}$ ”), based on the first-listed sense in \smallerMacmillan. In each case, the sense distribution is based on allocating all probability mass for a given word to the single sense identified by the respective method.

We first notice that, despite the coarser-grained senses of \smallerMacmillanas compared to \smallerWordNet, the upper bound wsdaccuracy using \smallerMacmillanis comparable to that of the \smallerWordNet-based datasets over the balanced BNC, and quite a bit lower than that of the two domain corpora of . This suggests that both datasets are diverse in domain and content.

In terms of wsdaccuracy, the results over ukWaC(ERR= 0.895) are substantially higher than those for BNC, while those over Twitter(ERR= 0.716) are comparable. The accuracy is significantly higher than the dictionary-based first sense baseline (FS ${}_{\text{\sc dict}}$ ) over both datasets (McNemar’s test; $p<0.0001$ ), and the ERRis also considerably higher than for the two domain datasets in Section 4 (FINANCEand SPORTS). One cause of difficulty in sense-modelling Twitteris large numbers of missing senses, with 12.3% of usages in Twitterand 6.6% in ukWaChaving no corresponding \smallerMacmillansense.¹²¹²The relative occurrence of unlisted/unclear senses in the datasets of is comparable to ukWaC. This challenges the assumption built into the sense prevalence calculation that all topics will align to a pre-existing sense, a point we return to in Section 5.2.

The JS divergence results for both datasets are well below (= better than) the results for all three \smallerWordNet-based datasets, and also superior to both the supervised and unsupervised first-sense baselines. Part of the reason for this improvement is simply that the average polysemy in \smallerMacmillan(5.6 senses per target lemma) is slightly less than in \smallerWordNet(6.7 senses per target lemma),¹³¹³Note that the set of lemmas differs between the respective datasets, so this isn’t an accurate reflection of the relative granularity of the two dictionaries. making the task slightly easier in the \smallerMacmillancase.

\smaller

Dataset	$P$	$R$	$F$
ukWaC	0.73	0.85	0.74
Twitter	0.56	0.88	0.65

Table 6: Evaluation of our method for identifying unattested senses, averaged over 10 runs of 10-fold cross validation

5.2 Identification of Unattested Senses

We observed in Section 5.1 that there are relatively frequent occurrences of usages (e.g. 12.3% for Twitter) which aren’t captured by \smallerMacmillan. Conversely, there are also senses in \smallerMacmillanwhich aren’t attested in the annotated sample of usages. Specifically, of the 112 senses defined for the 20 target lemmas, 25 (= 22.3%) of the senses are not attested in the 2000 usages in either corpora. Given that our methodology computes a prevalence score for each sense, it can equally be applied to the detection of these unattested senses, and it is this task that we address in this section: the identification of senses that are defined in the sense inventory but not attested in a given corpus.

Intuitively, an unused sense should have low similarity with the HDP induced topics. As such, we introduce sense-to-topic affinity, a measure that estimates how likely a sense is not attested in the corpus:

\text{st-affinity}(s_{i})=\frac{\sum^{T}_{j}\text{sim}(s_{i},t_{j})}{\sum^{S}_% {k}\sum^{T}_{l}\text{sim}(s_{k},t_{l})}

(3)

where $\text{sim}(s_{i},t_{j})$ is carried over from Equation (1), and $T$ and $S$ represent the number of topics and senses, respectively.

We treat the task of identification of unused senses as a binary classification problem, where the goal is to find a sense-to-topic affinity threshold below which a sense will be considered to be unused. We pool together all the senses and run 10-fold cross validation to learn the threshold for identifying unused senses,¹⁴¹⁴We used a fixed step and increment at steps of 0.001, up to the max value of st-affinity when optimising the threshold. evaluated using sense-level precision ( $P$ ), recall ( $R$ ) and F-score ( $F$ ) at detecting unattested senses. We repeat the experiment 10 times (partitioning the items randomly into folds) and collect the mean precision, recall and F-scores across the 10 runs. We found encouraging results for the task, as detailed in Table 6. For the threshold, the average value with standard deviation is $0.092\pm 0.044$ over ukWaCand $0.125{\pm}0.052$ over Twitter, indicating relative stability in the value of the threshold both internally within a dataset, and also across datasets.

5.3 Identification of Novel Senses

No. Lemmas with	Relative Freq	Threshold	P	R	F
a Removed Sense	of Removed Sense	Mean $\pm$ stdev	P	R	F
20	0.0–0.2	0.052 $\pm$ 0.009	0.35	0.42	0.36
9	0.2–0.4	0.089 $\pm$ 0.024	0.24	0.59	0.29
6	0.4–0.6	0.061 $\pm$ 0.004	0.63	0.64	0.63

Table 7: Classification of usages with novel sense for all target lemmas.

No. Lemmas with	Relative Freq	Threshold	P	R	F
a Removed Sense	of Removed Sense	Mean $\pm$ stdev	P	R	F
9	0.2–0.4	0.093 $\pm$ 0.023	0.50	0.66	0.52
6	0.4–0.6	0.099 $\pm$ 0.018	0.73	0.90	0.80

Table 8: Classification of usages with novel sense for target lemmas with a removed sense.

No. of Lemmas with	No. of Lemmas without	Relative Freq	Wilcoxon Rank Sum
a Removed Sense	a Removed Sense	of Removed Sense	$p$ -value
10	0	0.0–0.2	0.4543
9	11	0.2–0.4	0.0391
6	14	0.4–0.6	0.0247

Table 9: Wilcoxon Rank Sum

p

-value results for testing target lemmas with removed sense vs. target lemmas without removed sense using novelty.

In both Twitterand ukWaC, we observed frequent occurrences of usages of our target nouns which didn’t map onto a pre-existing \smallerMacmillansense. A natural question to ask is whether our method can be used to predict word senses that are missing from our sense inventory, and identify usages associated with each such missing sense. We will term these “novel senses”, and define “novel sense identification” to be the task of identifying new senses that are not recorded in the inventory but are seen in the corpus.

An immediate complication in evaluating novel sense identification is that we are attempting to identify senses which explicitly aren’t in our sense inventory. This contrasts with the identification of unattested senses, e.g., where we were attempting to identify which of the known senses wasn’t observed in the corpus. Also, while we have annotations of “Other” usages in Twitterand ukWaC, there is no real expectation that all such usages will correspond to the same sense: in practice, they are attributable to a myriad of effects such as incorporation in a non-compositional multiword expression, and errors in POS tagging (i.e. the usage not being nominal). As such, we can’t use the “Other” annotations to evaluate novel sense identification. The evaluation of systems for this task is a known challenge, which we address similarly to Erk (2006) by artificially synthesising novel senses through removal of senses from the sense inventory. In this way, even if we remove multiple senses for a given word, we still have access to information about which usages correspond to which novel sense. An additional advantage of this procedure is that it allows us to control an important property of novel senses: their frequency of occurrence.

In the experiments that follow, we randomly select senses for removal from three frequency bands: low, medium and high frequency senses. Frequency is defined by relative occurrence in the annotated usages: low = 0.0–0.2; medium = 0.2–0.4; and high = 0.4–0.6. Note that we do not consider high-frequency senses with frequency higher than 0.6, as it is rare for a medium- to high-frequency word to take on a novel sense which is then the predominant sense in a given corpus. Note also that not all target lemmas will have a novel sense through synthesis, as they may have no senses that fall within the indicated bounds of relative occurrence (e.g. if $>60\%$ of usages are a single sense). For example, only 6 of our 20 target nouns have senses which are candidates for high-frequency novel senses.

As before, we treat the novel sense identification task as a classification problem, although with a significantly different formulation: we are no longer attempting to identify pre-existing senses, as novel senses are by definition not included in the sense inventory. Instead, we are seeking to identify clusters of usages which are instances of a novel sense, e.g. for presentation to a lexicographer as part of a dictionary update process []. That is, for each usage, we want to classify whether it is an instance of a given novel sense.

A usage that corresponds to a novel sense should have a topic that does not align well with any of the pre-existing senses in the sense inventory. Based on this intuition, we introduce topic-to-sense affinity to estimate the similarity of a topic to the set of senses, as follows:

\text{ts-affinity}(t_{j})=\frac{\sum^{S}_{i}\text{sim}(s_{i},t_{j})}{\sum^{T}_% {l}\sum^{S}_{k}\text{sim}(s_{k},t_{l})}

(4)

where, once again, $\text{sim}(s_{i},t_{j})$ is defined as in Equation (1), and $T$ and $S$ represent the number of topics and senses, respectively.

Using topic-to-sense affinity as the sole feature, we pool together all instances and optimise the affinity feature to classify instances that have novel senses. Evaluation is done by computing the mean precision, recall and F-score across 10 separate runs; results are summarised in Table 7. Note that we evaluate only over ukWaCin this section, for ease of presentation.

The results show that instances with high-frequency novel senses are more easily identifiable than instances with medium/low-frequency novel senses. This is unsurprising given that high-frequency senses have a higher probability of generating related topics (sense-related words are observed more frequently in the corpus), and as such are more easily identifiable.

We are interested in understanding whether pooling all instances — instances from target lemmas that have a sense artificially removed and those that do not — impacted the results (recall that not all target lemmas have a removed sense). To that end, we chose to include only instances from lemmas with a removed sense, and repeated the experiment for the medium- and high-frequency novel sense condition (for the low-frequency condition, all target lemmas have a novel sense). In other words, we are assuming knowledge of which words have novel sense, and the task is to identify specifically what the novel sense is, as represented by novel usages. Results are presented in Table 8.

From the results, we see that the F-scores improved notably. This reveals that an additional step is necessary to determine whether a target lemma has a potential novel sense before feeding its instances to learn which of them contains the usage of the novel sense.

In the last experiment, we propose a new measure to tackle this: the identification of target lemmas that have a novel sense. We introduce novelty, a measure of the likelihood of a target lemma $w$ having a novel sense:

\text{novelty}(w)=\min_{t_{j}}\left(\max_{s_{i}}\frac{\text{sim}(s_{i},t_{j})}% {f(t_{j})}\right)

(5)

where $f(t_{j})$ is the frequency of topic $t_{j}$ in the corpus. The intuition behind novelty is that a target lemma with a novel sense should have a (somewhat-)frequent topic that has low association with any sense. That we use the frequency rather than the probability of the topic here is deliberate, as topics with a higher raw number of occurrences (whether as a low-probability topic for a high-frequency word, or a high-probability topic for a low-frequency word) are indicative of a novel word sense.

For each of our three datasets (with low-, medium- and high-frequency novel senses, respectively), we compute the novelty of the target lemmas and the $p$ -value of a one-tailed Wilcoxon rank sum test to test if the two groups of lemmas (i.e. lemmas with a novel sense vs. lemmas without a novel sense) are statistically different.¹⁵¹⁵Note that the number of words with low-frequency novel senses here is restricted to 10 (cf. 20 in Table 7) to ensure we have both positive and negative lemmas in the dataset. Results are presented in Table 9. We see that the novelty measure can readily identify target lemmas with high- and medium-frequency novel senses ( $p<0.05$ ), but the results are less promising for the low-frequency novel senses.

6 Discussion

Our methodologies for the two proposed tasks of identifying unused and novel senses are simple extensions to demonstrate the flexibility and robustness of our methodology. Future work could pursue a more sophisticated methodology, using non-linear combinations of $\text{sim}(s_{i},t_{j})$ for computing the affinity measures or multiple features in a supervised context. We contend, however, that these extensions are ultimately a preliminary demonstration to the flexibility and robustness of our methodology.

A natural next step for this research would be to couple sense distribution estimation and the detection of unattested senses with evidence from the context, using topics or other information about the local context (e.g. ) to carry out unsupervised wsdof individual token occurrences of a given word.

In summary, we have proposed a topic modelling-based method for estimating word sense distributions, based on Hierarchical Dirichlet Processes and the earlier work of on word sense induction, in probabilistically mapping the automatically-learned topics to senses in a sense inventory. We evaluated the ability of the method to learn predominant senses and induce word sense distributions, based on a broad range of datasets and two separate sense inventories. In doing so, we established that our method is comparable to the approach of at predominant sense learning, and superior at inducing word sense distributions. We further demonstrated the applicability of the method to the novel tasks of detecting word senses which are unattested in a corpus, and identifying novel senses which are found in a corpus but not captured in a word sense inventory.

Acknowledgements

We wish to thank the anonymous reviewers for their valuable comments. This research was supported in part by funding from the Australian Research Council.

References

[1] J. Boyd-Graber, D. Blei and X. Zhu(2007) A topic model for word sense disambiguation. Prague, Czech Republic, pp. 1024–1033. External Links: Link Cited by: 1.
[2] J. Boyd-Graber and D. Blei(2007) PUTOP: turning predominant senses into a topic model for word sense disambiguation. Prague, Czech Republic, pp. 277–281. External Links: Link Cited by: 2.
[3] J. F. Cai, W. S. Lee and Y. W. Teh(2007) NUS-ML: improving word sense disambiguation using topic features. Prague, Czech Republic, pp. 249–252. External Links: Link Cited by: 1.
[4] M. Carpuat, H. Daumé III, K. Henry, A. Irvine, J. Jagarlamudi and R. Rudinger(2013) SenseSpotting: never let your parallel data tie you to an old domain. Sofia, Bulgaria, pp. 1435–1445. Cited by: 2.
[5] Y. S. Chan and H. T. Ng(2005) Word sense disambiguation with distribution estimation. Edinburgh, UK, pp. 1010–1015. Cited by: 2, 2.
[6] Y. S. Chan and H. T. Ng(2006) Estimating class priors in domain adaptation for word sense disambiguation. Sydney, Australia, pp. 89–96. External Links: Link, Document Cited by: 2, 2.
[7] K. Erk(2006) Unknown word sense detection as outlier detection. New York City, USA, pp. 128–135. Cited by: 1, 5.3.
[8] A. Ferraresi, E. Zanchetta, M. Baroni and S. Bernardini(2008) Introducing and evaluating ukWaC, a very large web-derived corpus of English. Marrakech, Morocco, pp. 47–54. Cited by: 5.
[9] S. Gella, P. Cook and T. Baldwin(to appear) One sense per tweeter … and other lexical semantic tales of Twitter. Gothenburg, Sweden. Cited by: 5, 5.
[10] R. Iida, D. McCarthy and R. Koeling(2008) Gloss-based semantic similarity metrics for predominant sense acquisition. pp. 561–568. Cited by: 4.
[11] J. Knopp, J. Völker and S. P. Ponzetto(2013) Topic modeling for word sense induction. Darmstadt, Germany, pp. 97–103. Cited by: 1.
[12] M. Lapata and C. Brew(2004) Verb class disambiguation using informative priors. Computational Linguistics 30 (1), pp. 45–75. Cited by: 1, 2.
[13] L. Li, B. Roth and C. Sporleder(2010) Topic models for word sense disambiguation and token-based idiom detection. Uppsala, Sweden, pp. 1138–1147. External Links: Link Cited by: 1.
[14] D. McCarthy, R. Koeling, J. Weeds and J. Carroll(2004) Automatic identification of infrequent word senses. Geneva, Switzerland, pp. 1220–1226. Cited by: 1.
[15] G. A. Miller, C. Leacock, R. Tengi and R. T. Bunker(1993) A semantic concordance. pp. 303–308. Cited by: 2.
[16] S. Mohammad and G. Hirst(2006) Determining word sense dominance using a thesaurus. Trento, Italy, pp. 121–128. Cited by: 2.
[17] R. Navigli and P. Velardi(2005) Structural semantic interconnections: a knowledge-based approach to word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (7), pp. 1075–1088. Cited by: 2.
[18] Y. Peirsman, D. Geeraerts and D. Speelman(2010) The automatic identification of lexical variation between language varieties. Natural Language Engineering 16 (4), pp. 469–491. Cited by: 2.
[19] J. Preiss and M. Stevenson(2013) Unsupervised domain tuning to improve word sense disambiguation. Atlanta, USA, pp. 680–684. External Links: Link Cited by: 1.
[20] Z. Zhong and H. T. Ng(2010) It makes sense: a wide-coverage word sense disambiguation system for free text. Uppsala, Sweden, pp. 78–83. External Links: Link Cited by: 2.

Generated on Tue Jun 10 17:23:06 2014 by LaTeXML [LOGO]