Dependency-Based Word Embeddings

Omer Levy  Supported by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT).    Yoav Goldberg
Computer Science Department
Bar-Ilan University
Ramat-Gan, Israel
{omerlevy,yoav.goldberg}@gmail.com
Abstract

While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependency-based contexts, and show that they produce markedly different embeddings. The dependency-based embeddings are less topical and exhibit more functional similarity than the original skip-gram embeddings.

1 Introduction

Word representation is central to natural language processing. The default approach of representing words as discrete and distinct symbols is insufficient for many tasks, and suffers from poor generalization. For example, the symbolic representation of the words “pizza” and “hamburger” are completely unrelated: even if we know that the word “pizza” is a good argument for the verb “eat”, we cannot infer that “hamburger” is also a good argument. We thus seek a representation that captures semantic and syntactic similarities between words. A very common paradigm for acquiring such representations is based on the distributional hypothesis of Harris [16], stating that words in similar contexts have similar meanings.

Based on the distributional hypothesis, many methods of deriving word representations were explored in the NLP community. On one end of the spectrum, words are grouped into clusters based on their contexts [5, 32]. On the other end, words are represented as a very high dimensional but sparse vectors in which each entry is a measure of the association between the word and a particular context (see [30, 3] for a comprehensive survey). In some works, the dimensionality of the sparse word-context vectors is reduced, using techniques such as SVD [6] or LDA [25, 27, 8]. Most recently, it has been proposed to represent words as dense vectors that are derived by various training methods inspired from neural-network language modeling [4, 10, 23, 20, 21]. These representations, referred to as “neural embeddings” or “word embeddings”, have been shown to perform well across a variety of tasks [29, 9, 26, 2].

Word embeddings are easy to work with because they enable efficient computation of word similarities through low-dimensional matrix operations. Among the state-of-the-art word-embedding methods is the skip-gram with negative sampling model (SkipGram), introduced by Mikolov et al. [21] and implemented in the word2vec software.11code.google.com/p/word2vec/ Not only does it produce useful word representations, but it is also very efficient to train, works in an online fashion, and scales well to huge copora (billions of words) as well as very large word and context vocabularies.

Previous work on neural word embeddings take the contexts of a word to be its linear context – words that precede and follow the target word, typically in a window of k tokens to each side. However, other types of contexts can be explored too.

In this work, we generalize the SkipGram model, and move from linear bag-of-words contexts to arbitrary word contexts. Specifically, following work in sparse vector-space models [18, 24, 3], we experiment with syntactic contexts that are derived from automatically produced dependency parse-trees.

The different kinds of contexts produce noticeably different embeddings, and induce different word similarities. In particular, the bag-of-words nature of the contexts in the “original” SkipGram model yield broad topical similarities, while the dependency-based contexts yield more functional similarities of a cohyponym nature. This effect is demonstrated using both qualitative and quantitative analysis (Section 4).

The neural word-embeddings are considered opaque, in the sense that it is hard to assign meanings to the dimensions of the induced representation. In Section 5 we show that the SkipGram model does allow for some introspection by querying it for contexts that are “activated by” a target word. This allows us to peek into the learned representation and explore the contexts that are found by the learning process to be most discriminative of particular words (or groups of words). To the best of our knowledge, this is the first work to suggest such an analysis of discriminatively-trained word-embedding models.

2 The Skip-Gram Model

Our departure point is the skip-gram neural embedding model introduced in [19] trained using the negative-sampling procedure presented in [21]. In this section we summarize the model and training objective following the derivation presented by Goldberg and Levy [13], and highlight the ease of incorporating arbitrary contexts in the model.

In the skip-gram model, each word wW is associated with a vector vwRd and similarly each context cC is represented as a vector vcRd, where W is the words vocabulary, C is the contexts vocabulary, and d is the embedding dimensionality. The entries in the vectors are latent, and treated as parameters to be learned. Loosely speaking, we seek parameter values (that is, vector representations for both words and contexts) such that the dot product vwvc associated with “good” word-context pairs is maximized.

More specifically, the negative-sampling objective assumes a dataset D of observed (w,c) pairs of words w and the contexts c, which appeared in a large body of text. Consider a word-context pair (w,c). Did this pair come from the data? We denote by p(D=1|w,c) the probability that (w,c) came from the data, and by p(D=0|w,c)=1-p(D=1|w,c) the probability that (w,c) did not. The distribution is modeled as:

p(D=1|w,c)=11+e-vwvc

where vw and vc (each a d-dimensional vector) are the model parameters to be learned. We seek to maximize the log-probability of the observed pairs belonging to the data, leading to the objective:

argmaxvw,vc(w,c)Dlog11+e-vcvw

This objective admits a trivial solution in which p(D=1|w,c)=1 for every pair (w,c). This can be easily achieved by setting vc=vw and vcvw=K for all c,w, where K is large enough number.

In order to prevent the trivial solution, the objective is extended with (w,c) pairs for which p(D=1|w,c) must be low, i.e. pairs which are not in the data, by generating the set D of random (w,c) pairs (assuming they are all incorrect), yielding the negative-sampling training objective:

argmaxvw,vc((w,c)Dp(D=1|c,w)(w,c)Dp(D=0|c,w))

which can be rewritten as:

argmaxvw,vc((w,c)Dlogσ(vcvw)+(w,c)Dlogσ(-vcvw))

where σ(x)=1/(1+ex). The objective is trained in an online fashion using stochastic-gradient updates over the corpus DD.

The negative samples D can be constructed in various ways. We follow the method proposed by Mikolov et al.: for each (w,c)D we construct n samples (w,c1),,(w,cn), where n is a hyperparameter and each cj is drawn according to its unigram distribution raised to the 3/4 power.

Optimizing this objective makes observed word-context pairs have similar embeddings, while scattering unobserved pairs. Intuitively, words that appear in similar contexts should have similar embeddings, though we have not yet found a formal proof that SkipGram does indeed maximize the dot product of similar words.

3 Embedding with Arbitrary Contexts

In the SkipGram embedding algorithm, the contexts of a word w are the words surrounding it in the text. The context vocabulary C is thus identical to the word vocabulary W. However, this restriction is not required by the model; contexts need not correspond to words, and the number of context-types can be substantially larger than the number of word-types. We generalize SkipGram by replacing the bag-of-words contexts with arbitrary contexts.

In this paper we experiment with dependency-based syntactic contexts. Syntactic contexts capture different information than bag-of-word contexts, as we demonstrate using the sentence “Australian scientist discovers star with telescope”.

Linear Bag-of-Words Contexts

This is the context used by word2vec and many other neural embeddings. Using a window of size k around the target word w, 2k contexts are produced: the k words before and the k words after w. For k=2, the contexts of the target word w are w-2,w-1,w+1,w+2. In our example, the contexts of discovers are Australian, scientist, star, with.22word2vec’s implementation is slightly more complicated. The software defaults to prune rare words based on their frequency, and has an option for sub-sampling the frequent words. These pruning and sub-sampling happen before the context extraction, leading to a dynamic window size. In addition, the window size is not fixed to k but is sampled uniformly in the range [1,k] for each word.

Note that a context window of size 2 may miss some important contexts (telescope is not a context of discovers), while including some accidental ones (Australian is a context discovers). Moreover, the contexts are unmarked, resulting in discovers being a context of both stars and scientist, which may result in stars and scientists ending up as neighbours in the embedded space. A window size of 5 is commonly used to capture broad topical content, whereas smaller windows contain more focused information about the target word.

Dependency-Based Contexts

An alternative to the bag-of-words approach is to derive contexts based on the syntactic relations the word participates in. This is facilitated by recent advances in parsing technology [14, 15] that allow parsing to syntactic dependencies with very high speed and near state-of-the-art accuracy.

After parsing each sentence, we derive word contexts as follows: for a target word w with modifiers m1,,mk and a head h, we consider the contexts (m1,lbl1),,(mk,lblk),(h,lblh-1), where lbl is the type of the dependency relation between the head and the modifier (e.g. nsubj, dobj, prep_with, amod) and lbl-1 is used to mark the inverse-relation. Relations that include a preposition are “collapsed” prior to context extraction, by directly connecting the head and the object of the preposition, and subsuming the preposition itself into the dependency label. An example of the dependency context extraction is given in Figure 1.

Notice that syntactic dependencies are both more inclusive and more focused than bag-of-words. They capture relations to words that are far apart and thus “out-of-reach” with small window bag-of-words (e.g. the instrument of discover is telescope/prep_with), and also filter out “coincidental” contexts which are within the window but not directly related to the target word (e.g. Australian is not used as the context for discovers). In addition, the contexts are typed, indicating, for example, that stars are objects of discovery and scientists are subjects. We thus expect the syntactic contexts to yield more focused embeddings, capturing more functional and less topical similarity.

{dependency}

[theme=simple] {deptext}[column sep=0.2cm] Australian & scientist & discovers & star & with & telescope
\depedge21amod \depedge32nsubj \depedge34dobj \depedge[edge style=dashed]35prep \depedge[edge style=dashed]56pobj

{dependency}

[theme=simple] {deptext}[column sep=0.2cm] Australian & scientist & discovers & star & telescope
\depedge21amod \depedge32nsubj \depedge34dobj \depedge[edge style=thick]35prep_with

word contexts
australian scientist/amod-1
scientist australian/amod, discovers/nsubj-1
discovers scientist/nsubj, star/dobj, telescope/prep_with
star discovers/dobj-1
telescope discovers/prep_with-1
Figure 1: Dependency-based context extraction example. Top: preposition relations are collapsed into single arcs, making telescope a direct modifier of discovers. Bottom: the contexts extracted for each word in the sentence.

4 Experiments and Evaluation

We experiment with 3 training conditions: BoW5 (bag-of-words contexts with k=5), BoW2 (same, with k=2) and Deps (dependency-based syntactic contexts). We modified word2vec to support arbitrary contexts, and to output the context embeddings in addition to the word embeddings. For bag-of-words contexts we used the original word2vec implementation, and for syntactic contexts, we used our modified version. The negative-sampling parameter (how many negative contexts to sample for every correct one) was 15.

All embeddings were trained on English Wikipedia. For Deps, the corpus was tagged with parts-of-speech using the Stanford tagger [28] and parsed into labeled Stanford dependencies [11] using an implementation of the parser described in [14]. All tokens were converted to lowercase, and words and contexts that appeared less than 100 times were filtered. This resulted in a vocabulary of about 175,000 words, with over 900,000 distinct syntactic contexts. We report results for 300 dimension embeddings, though similar trends were also observed with 600 dimensions.

4.1 Qualitative Evaluation

Our first evaluation is qualitative: we manually inspect the 5 most similar words (by cosine similarity) to a given set of target words (Table 1).

The first target word, Batman, results in similar sets across the different setups. This is the case for many target words. However, other target words show clear differences between embeddings.

In Hogwarts - the school of magic from the fictional Harry Potter series - it is evident that BoW contexts reflect the domain aspect, whereas Deps yield a list of famous schools, capturing the semantic type of the target word. This observation holds for Turing33Deps generated a list of scientists whose name ends with “ing”. This is may be a result of occasional POS-tagging errors. Still, the embedding does a remarkable job and retrieves scientists, despite the noisy POS. The list contains more mathematicians without “ing” further down. and many other nouns as well; BoW find words that associate with w, while Deps find words that behave like w. Turney [31] described this distinction as domain similarity versus functional similarity.

The Florida example presents an ontological difference; bag-of-words contexts generate meronyms (counties or cities within Florida), while dependency-based contexts provide cohyponyms (other US states). We observed the same behavior with other geographical locations, particularly with countries (though not all of them).

The next two examples demonstrate that similarities induced from Deps share a syntactic function (adjectives and gerunds), while similarities based on BoW are more diverse. Finally, we observe that while both BoW5 and BoW2 yield topical similarities, the larger window size result in more topicality, as expected.

We also tried using the subsampling option [21] with BoW contexts (not shown). Since word2vec removes the subsampled words from the corpus before creating the window contexts, this option effectively increases the window size, resulting in greater topicality.

Target Word BoW5 BoW2 Deps
batman nightwing superman superman
aquaman superboy superboy
catwoman aquaman supergirl
superman catwoman catwoman
manhunter batgirl aquaman
hogwarts dumbledore evernight sunnydale
hallows sunnydale collinwood
half-blood garderobe calarts
malfoy blandings greendale
snape collinwood millfield
turing nondeterministic non-deterministic pauling
non-deterministic finite-state hotelling
computability nondeterministic heting
deterministic buchi lessing
finite-state primality hamming
florida gainesville fla texas
fla alabama louisiana
jacksonville gainesville georgia
tampa tallahassee california
lauderdale texas carolina
object-oriented aspect-oriented aspect-oriented event-driven
smalltalk event-driven domain-specific
event-driven objective-c rule-based
prolog dataflow data-driven
domain-specific 4gl human-centered
dancing singing singing singing
dance dance rapping
dances dances breakdancing
dancers breakdancing miming
tap-dancing clowning busking
Table 1: Target words and their 5 most similar words, as induced by different embeddings.

4.2 Quantitative Evaluation

We supplement the examples in Table 1 with quantitative evaluation to show that the qualitative differences pointed out in the previous section are indeed widespread. To that end, we use the WordSim353 dataset [12, 1]. This dataset contains pairs of similar words that reflect either relatedness (topical similarity) or similarity (functional similarity) relations.44Some word pairs are judged to exhibit both types of similarity, and were ignored in this experiment. We use the embeddings in a retrieval/ranking setup, where the task is to rank the similar pairs in the dataset above the related ones.

The pairs are ranked according to cosine similarities between the embedded words. We then draw a recall-precision curve that describes the embedding’s affinity towards one subset (“similarity”) over another (“relatedness”). We expect Deps’s curve to be higher than BoW2’s curve, which in turn is expected to be higher than BoW5’s. The graph in Figure 2a shows this is indeed the case. We repeated the experiment with a different dataset [7] that was used by Turney [31] to distinguish between domain and functional similarities. The results show a similar trend (Figure 2b). When reversing the task such that the goal is to rank the related terms above the similar ones, the results are reversed, as expected (not shown).55Additional experiments (not presented in this paper) reinforce our conclusion. In particular, we found that Deps perform dramatically worse than BoW contexts on analogy tasks as in [22, 17].

Figure 2: Recall-precision curve when attempting to rank the similar words above the related ones. (a) is based on the WordSim353 dataset, and (b) on the Chiarello et al. dataset.

5 Model Introspection

Neural word embeddings are often considered opaque and uninterpretable, unlike sparse vector space representations in which each dimension corresponds to a particular known context, or LDA models where dimensions correspond to latent topics. While this is true to a large extent, we observe that SkipGram does allow a non-trivial amount of introspection. Although we cannot assign a meaning to any particular dimension, we can indeed get a glimpse at the kind of information being captured by the model, by examining which contexts are “activated” by a target word.

Recall that the learning procedure is attempting to maximize the dot product vcvw for good (w,c) pairs and minimize it for bad ones. If we keep the context embeddings, we can query the model for the contexts that are most activated by (have the highest dot product with) a given target word. By doing so, we can see what the model learned to be a good discriminative context for the word.

To demonstrate, we list the 5 most activated contexts for our example words with Deps embeddings in Table 2. Interestingly, the most discriminative syntactic contexts in these cases are not associated with subjects or objects of verbs (or their inverse), but rather with conjunctions, appositions, noun-compounds and adjectivial modifiers. Additionally, the collapsed preposition relation is very useful (e.g. for capturing the school aspect of hogwarts). The presence of many conjunction contexts, such as superman/conj for batman and singing/conj for dancing, may explain the functional similarity observed in Section 4; conjunctions in natural language tend to enforce their conjuncts to share the same semantic types and inflections.

In the future, we hope that insights from such model introspection will allow us to develop better contexts, by focusing on conjunctions and prepositions for example, or by trying to figure out why the subject and object relations are absent and finding ways of increasing their contributions.

batman hogwarts turing
superman/conj-1 students/prep_at-1 machine/nn-1
spider-man/conj-1 educated/prep_at-1 test/nn-1
superman/conj student/prep_at-1 theorem/poss-1
spider-man/conj stay/prep_at-1 machines/nn-1
robin/conj learned/prep_at-1 tests/nn-1
florida object-oriented dancing
marlins/nn-1 programming/amod-1 dancing/conj
beach/appos-1 language/amod-1 dancing/conj-1
jacksonville/appos-1 framework/amod-1 singing/conj-1
tampa/appos-1 interface/amod-1 singing/conj
florida/conj-1 software/amod-1 ballroom/nn
Table 2: Words and their top syntactic contexts.

6 Conclusions

We presented a generalization of the SkipGram embedding model in which the linear bag-of-words contexts are replaced with arbitrary ones, and experimented with dependency-based contexts, showing that they produce markedly different kinds of similarities. These results are expected, and follow similar findings in the distributional semantics literature. We also demonstrated how the resulting embedding model can be queried for the discriminative contexts for a given word, and observed that the learning procedure seems to favor relatively local syntactic contexts, as well as conjunctions and objects of preposition. We hope these insights will facilitate further research into improved context modeling and better, possibly task-specific, embedded representations. Our software, allowing for experimentation with arbitrary contexts, together with the embeddings described in this paper, are available for download at the authors’ websites.

References

  • [1] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca and A. Soroa(2009-06) A study on similarity and relatedness using distributional and wordnet-based approaches. Boulder, Colorado, pp. 19–27. External Links: Link Cited by: 4.2.
  • [2] R. Al-Rfou, B. Perozzi and S. Skiena(2013) Polyglot: distributed word representations for multilingual nlp. Cited by: 1.
  • [3] M. Baroni and A. Lenci(2010) Distributional memory: a general framework for corpus-based semantics. Computational Linguistics 36 (4), pp. 673–721. Cited by: 1, 1.
  • [4] Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin(2003) A neural probabilistic language model. Journal of Machine Learning Research 3, pp. 1137–1155. Cited by: 1.
  • [5] P. F. Brown, R. L. Mercer, V. J. Della Pietra and J. C. Lai(1992) Class-based n-gram models of natural. Computational Linguistics 18 (4). Cited by: 1.
  • [6] J. A. Bullinaria and J. P. Levy(2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods 39 (3), pp. 510–526. Cited by: 1.
  • [7] C. Chiarello, C. Burgess, L. Richards and A. Pollock(1990) Semantic and associative priming in the cerebral hemispheres: some words do, some words don’t… sometimes, some places. Brain and Language 38 (1), pp. 75–104. Cited by: 4.2.
  • [8] R. Cohen, Y. Goldberg and M. Elhadad(2012-07) Domain adaptation of a dependency parser with a class-class selectional preference model. Jeju Island, Korea, pp. 43–48. External Links: Link Cited by: 1.
  • [9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa(2011) Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: 1.
  • [10] R. Collobert and J. Weston(2008) A unified architecture for natural language processing: deep neural networks with multitask learning. pp. 160–167. Cited by: 1.
  • [11] M. de Marneffe and C. D. Manning(2008-08) The Stanford typed dependencies representation. Manchester, UK, pp. 1–8. External Links: Link Cited by: 4.
  • [12] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman and E. Ruppin(2002) Placing search in context: the concept revisited. ACM Transactions on Information Systems 20 (1), pp. 116–131. Cited by: 4.2.
  • [13] Y. Goldberg and O. Levy(2014) Word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Cited by: 2.
  • [14] Y. Goldberg and J. Nivre(2012) A dynamic oracle for the arc-eager system. Cited by: 3, 4.
  • [15] Y. Goldberg and J. Nivre(2013) Training deterministic parsers with non-deterministic oracles. Transactions of the association for Computational Linguistics 1. Cited by: 3.
  • [16] Z. Harris(1954) Distributional structure. Word 10 (23), pp. 146–162. Cited by: 1.
  • [17] O. Levy and Y. Goldberg(2014-06) Linguistic regularities in sparse and explicit word representations. Baltimore, Maryland, USA. Cited by: 4.2.
  • [18] D. Lin(1998) Automatic retrieval and clustering of similar words. ACL ’98, Stroudsburg, PA, USA, pp. 768–774. External Links: Link, Document Cited by: 1.
  • [19] T. Mikolov, K. Chen, G. Corrado and J. Dean(2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: 2.
  • [20] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky and S. Khudanpur(2011) Extensions of recurrent neural network language model. pp. 5528–5531. Cited by: 1.
  • [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean(2013) Distributed representations of words and phrases and their compositionality. pp. 3111–3119. Cited by: 1, 1, 2, 4.1.
  • [22] T. Mikolov, W. Yih and G. Zweig(2013-06) Linguistic regularities in continuous space word representations. Atlanta, Georgia, pp. 746–751. External Links: Link Cited by: 4.2.
  • [23] A. Mnih and G. E. Hinton(2008) A scalable hierarchical distributed language model. pp. 1081–1088. Cited by: 1.
  • [24] S. Padó and M. Lapata(2007) Dependency-based construction of semantic space models. Computational Linguistics 33 (2), pp. 161–199. Cited by: 1.
  • [25] A. Ritter, Mausam and O. Etzioni(2010) A latent dirichlet allocation method for selectional preferences. pp. 424–434. Cited by: 1.
  • [26] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng and C. D. Manning(2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. pp. 151–161. Cited by: 1.
  • [27] D. Ó. Séaghdha(2010) Latent variable models of selectional preference. pp. 435–444. Cited by: 1.
  • [28] K. Toutanova, D. Klein, C. Manning and Y. Singer(2003) Feature-rich part-of-speech tagging with a cyclic dependency network. Cited by: 4.
  • [29] J. Turian, L. Ratinov and Y. Bengio(2010) Word representations: a simple and general method for semi-supervised learning. pp. 384–394. Cited by: 1.
  • [30] P.D. Turney and P. Pantel(2010) From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37 (1), pp. 141–188. Cited by: 1.
  • [31] P. D. Turney(2012) Domain and function: a dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research 44, pp. 533–585. Cited by: 4.1, 4.2.
  • [32] J. Uszkoreit and T. Brants(2008) Distributed word clustering for large scale class-based language modeling in machine translation. pp. 755–762. Cited by: 1.