Is this a wampimuk?
Cross-modal mapping between distributional semantics
and the visual world

Angeliki Lazaridou Elia Bruni Marco Baroni
Center for Mind/Brain Sciences
University of Trento
{angeliki.lazaridou|elia.bruni|marco.baroni}@unitn.it

Abstract

Following up on recent work on establishing a mapping between vector-based semantic embeddings of words and the visual representations of the corresponding objects from natural images, we first present a simple approach to cross-modal vector-based semantics for the task of zero-shot learning, in which an image of a previously unseen object is mapped to a linguistic representation denoting its word. We then introduce fast mapping, a challenging and more cognitively plausible variant of the zero-shot task, in which the learner is exposed to new objects and the corresponding words in very limited linguistic contexts. By combining prior linguistic and visual knowledge acquired about words and their objects, as well as exploiting the limited new evidence available, the learner must learn to associate new objects with words. Our results on this task pave the way to realistic simulations of how children or robots could use existing knowledge to bootstrap grounded semantic knowledge about new concepts.

1 Introduction

Computational models of meaning that rely on corpus-extracted context vectors, such as LSA [31], HAL [36], Topic Models [20] and more recent neural-network approaches [11, 38] have successfully tackled a number of lexical semantics tasks, where context vector similarity highly correlates with various indices of semantic relatedness [53]. Given that these models are learned from naturally occurring data using simple associative techniques, various authors have advanced the claim that they might be also capturing some crucial aspects of how humans acquire and use language [31, 33].

However, the models induce the meaning of words entirely from their co-occurrence with other words, without links to the external world. This constitutes a serious blow to claims of cognitive plausibility in at least two respects. One is the grounding problem [24, 43]. Irrespective of their relatively high performance on various semantic tasks, it is debatable whether models that have no access to visual and perceptual information can capture the holistic, grounded knowledge that humans have about concepts. However, a possibly even more serious pitfall of vector models is lack of reference: natural language is, fundamentally, a means to communicate, and thus our words must be able to refer to objects, properties and events in the outside world [1]. Current vector models are purely language-internal, solipsistic models of meaning. Consider the very simple scenario in which visual information is being provided to an agent about the current state of the world, and the agent’s task is to determine the truth of a statement similar to There is a dog in the room. Although the agent is equipped with a powerful context vector model, this will not suffice to successfully complete the task. The model might suggest that the concepts of dog and cat are semantically related, but it has no means to determine the visual appearance of dogs, and consequently no way to verify the truth of such a simple statement.

Mapping words to the objects they denote is such a core function of language that humans are highly optimized for it, as shown by the so-called fast mapping phenomenon, whereby children can learn to associate a word to an object or property by a single exposure to it [2, 8, 7, 25]. But lack of reference is not only a theoretical weakness: Without the ability to refer to the outside world, context vectors are arguably useless for practical goals such as learning to execute natural language instructions [3, 10], that could greatly benefit from the rich network of lexical meaning such vectors encode, in order to scale up to real-life challenges.

Very recently, a number of papers have exploited advances in automated feature extraction form images and videos to enrich context vectors with visual information [5, 16, 34, 42, 44]. This line of research tackles the grounding problem: Word representations are no longer limited to their linguistic contexts but also encode visual information present in images associated with the corresponding objects. In this paper, we rely on the same image analysis techniques but instead focus on the reference problem: We do not aim at enriching word representations with visual information, although this might be a side effect of our approach, but we address the issue of automatically mapping objects, as depicted in images, to the context vectors representing the corresponding words. This is achieved by means of a simple neural network trained to project image-extracted feature vectors to text-based vectors through a hidden layer that can be interpreted as a cross-modal semantic space.

We first test the effectiveness of our cross-modal semantic space on the so-called zero-shot learning task [40], which has recently been explored in the machine learning community [18, 49]. In this setting, we assume that our system possesses linguistic and visual information for a set of concepts in the form of text-based representations of words and image-based vectors of the corresponding objects, used for vision-to-language-mapping training. The system is then provided with visual information for a previously unseen object, and the task is to associate it with a word by cross-modal mapping. Our approach is competitive with respect to the recently proposed alternatives, while being overall simpler.

The aforementioned task is very demanding and interesting from an engineering point of view. However, from a cognitive angle, it relies on strong, unrealistic assumptions: The learner is asked to establish a link between a new object and a word for which they possess a full-fledged text-based vector extracted from a billion-word corpus. On the contrary, the first time a learner is exposed to a new object, the linguistic information available is likely also very limited. Thus, in order to consider vision-to-language mapping under more plausible conditions, similar to the ones that children or robots in a new environment are faced with, we next simulate a scenario akin to fast mapping. We show that the induced cross-modal semantic space is powerful enough that sensible guesses about the correct word denoting an object can be made, even when the linguistic context vector representing the word has been created from as little as 1 sentence containing it.

The contributions of this work are three-fold. First, we conduct experiments with simple image- and text-based vector representations and compare alternative methods to perform cross-modal mapping. Then, we complement recent work [18] and show that zero-shot learning scales to a large and noisy dataset. Finally, we provide preliminary evidence that cross-modal projections can be used effectively to simulate a fast mapping scenario, thus strengthening the claims of this approach as a full-fledged, fully inductive theory of meaning acquisition.

2 Related Work

The problem of establishing word reference has been extensively explored in computational simulations of cross-situational learning (see Fazly et al. (2010) for a recent proposal and extended review of previous work). This line of research has traditionally assumed artificial models of the external world, typically a set of linguistic or logical labels for objects, actions and possibly other aspects of a scene [46]. Recently, Yu and Siskind (2013) presented a system that induces word-object mappings from features extracted from short videos paired with sentences. Our work complements theirs in two ways. First, unlike Yu and Siskind (2013) who considered a limited lexicon of 15 items with only 4 nouns, we conduct experiments in a large search space containing a highly ambiguous set of potential target words for every object (see Section 4.1). Most importantly, by projecting visual representations of objects into a shared semantic space, we do not limit ourselves to establishing a link between objects and words. We induce a rich semantic representation of the multimodal concept, that can lead, among other things, to the discovery of important properties of an object even when we lack its linguistic label. Nevertheless, Yu and Siskind’s system could in principle be used to initialize the vision-language mapping that we rely upon.

Closer to the spirit of our work are two very recent studies coming from the machine learning community. Socher et al. (2013) and Frome et al. (2013) focus on zero-shot learning in the vision-language domain by exploiting a shared visual-linguistic semantic space. Socher et al. (2013) learn to project unsupervised vector-based image representations onto a word-based semantic space using a neural network architecture. Unlike us, Socher and colleagues train an outlier detector to decide whether a test image should receive a known-word label by means of a standard supervised object classifier, or be assigned an unseen label by vision-to-language mapping. In our zero-shot experiments, we assume no access to an outlier detector, and thus, the search for the correct label is performed in the full concept space. Furthermore, Socher and colleagues present a much more constrained evaluation setup, where only 10 concepts are considered, compared to our experiments with hundreds or thousands of concepts.

Frome et al. (2013) use linear regression to transform vector-based image representations onto vectors representing the same concepts in linguistic semantic space. Unlike Socher et al. (2013) and the current study that adopt simple unsupervised techniques for constructing image representations, Frome et al. (2013) rely on a supervised state-of-the-art method: They feed low-level features to a deep neural network trained on a supervised object recognition task [29]. Furthermore, their text-based vectors encode very rich information, such as $\vec{king}-\vec{man}+\vec{woman}=\vec{queen}$ [39]. A natural question we aim to answer is whether the success of cross-modal mapping is due to the high-quality embeddings or to the general algorithmic design. If the latter is the case, then these results could be extended to traditional distributional vectors bearing other desirable properties, such as high interpretability of dimensions.

3 Zero-shot learning and fast mapping

“We found a cute, hairy wampimuk sleeping behind the tree.” Even though the previous statement is certainly the first time one hears about wampimuks, the linguistic context already creates some visual expectations: Wampimuks probably resemble small animals (Figure 1). This is the scenario of zero-shot learning. Moreover, if this is also the first linguistic encounter of that concept, then we refer to the task as fast mapping.

{subfigure}

[b].25 \subcaption {subfigure}[b].24 \subcaption

Figure 1: A potential wampimuk (a) together with its projection onto the linguistic space (b).

Concretely, we assume that concepts, denoted for convenience by word labels, are represented in linguistic terms by vectors in a text-based distributional semantic space (see Section 4.3). Objects corresponding to concepts are represented in visual terms by vectors in an image-based semantic space (Section 4.2). For a subset of concepts (e.g., a set of animals, a set of vehicles), we possess information related to both their linguistic and visual representations. During training, this cross-modal vocabulary is used to induce a projection function (Section 4.4), which – intuitively – represents a mapping between visual and linguistic dimensions. Thus, this function, given a visual vector, returns its corresponding linguistic representation. At test time, the system is presented with a previously unseen object (e.g., wampimuk). This object is projected onto the linguistic space and associated with the word label of the nearest neighbor in that space (degus in Figure 1).

The fast mapping setting can be seen as a special case of the zero-shot task. Whereas for the latter our system assumes that all concepts have rich linguistic representations (i.e., representations estimated from a large corpus), in the case of the former, new concepts are assumed to be encounted in a limited linguistic context and therefore lacking rich linguistic representations. This is operationalized by constructing the text-based vector for these concepts from a context of just a few occurrences. In this way, we simulate the first encounter of a learner with a concept that is new in both visual and linguistic terms.

4 Experimental Setup

4.1 Visual Datasets

CIFAR-100

The CIFAR-100 dataset [30] consists of 60,000 32x32 colour images (note the extremely small size) representing 100 distinct concepts, with 600 images per concept. The dataset covers a wide range of concrete domains and is organized into 20 broader categories. Table 1 lists the concepts used in our experiments organized by category.

Category	Seen Concepts	Unseen (Test) Concepts
aquatic mammals	beaver, otter, seal, whale	dolphin
fish	ray, trout	shark
flowers	orchid, poppy, sunflower, tulip	rose
food containers	bottle, bowl, can ,plate	cup
fruit vegetable	apple, mushroom, pear	orange
household electrical devices	keyboard, lamp, telephone, television	clock
household furniture	chair, couch, table, wardrobe	bed
insects	bee, beetle, caterpillar, cockroach	butterfly
large carnivores	bear, leopard, lion, wolf	tiger
large man-made outdoor things	bridge, castle, house, road	skyscraper
large natural outdoor scenes	cloud, mountain, plain, sea	forest
large omnivores and herbivores	camel, cattle, chimpanzee, kangaroo	elephant
medium-sized mammals	fox, porcupine, possum, skunk	raccoon
non-insect invertebrates	crab, snail, spider, worm	lobster
people	baby, girl, man, woman	boy
reptiles	crocodile, dinosaur, snake, turtle	lizard
small mammals	hamster, mouse, rabbit, shrew	squirrel
vehicles 1	bicycle, motorcycle, train	bus
vehicles 2	rocket, tank, tractor	streetcar

Table 1: Concepts in our version of the CIFAR-100 data set

ESP

Our second dataset consists of 100K images from the ESP-Game data set, labeled through a “game with a purpose” [55].¹¹http://www.cs.cmu.edu/~biglou/resources/ The ESP image tags form a vocabulary of 20,515 unique words. Unlike other datasets used for zero-shot learning, it covers adjectives and verbs in addition to nouns. On average, an image has 14 tags and a word appears as a tag for 70 images. Unlike the CIFAR-100 images, which were chosen specifically for image object recognition tasks (i.e., each image is clearly depicting a single object in the foreground), ESP contains a random selection of images from the Web. Consequently, objects do not appear in most images in their prototypical display, but rather as elements of complex scenes (see Figure 2). Thus, ESP constitutes a more realistic, and at the same time more challenging, simulation of how things are encountered in real life, testing the potentials of cross-modal mapping in dealing with the complex scenes that one would encounter in event recognition and caption generation tasks.

{subfigure}

.25 {subfigure}.22

Figure 2: Images of chair as extracted from CIFAR-100 (left) and ESP (right).

4.2 Visual Semantic Spaces

Image-based vectors are extracted using the unsupervised bag-of-visual-words (BoVW) representational architecture [47, 12], that has been widely and successfully applied to computer vision tasks such as object recognition and image retrieval [56]. First, low-level visual features [52] are extracted from a large collection of images and clustered into a set of “visual words”. The low-level features of a specific image are then mapped to the corresponding visual words, and the image is represented by a count vector recording the number of occurrences of each visual word in it. We do not attempt any parameter tuning of the pipeline.

As low-level features, we use Scale Invariant Feature Transform (SIFT) features [35]. SIFT features are tailored to capture object parts and to be invariant to several image transformations such as rotation, illumination and scale change. These features are clustered into vocabularies of 5,000 (ESP) and 4,096 (CIFAR-100) visual words.²²For selecting the size of the vocabulary size, we relied on standard settings found in the relevant literature [5, 9]. To preserve spatial information in the BoVW representation, we use the spatial pyramid technique [32], which consists in dividing the image into several regions, computing BoVW vectors for each region and concatenating them. In particular, we divide ESP images into 16 regions and the smaller CIFAR-100 images into 4. The vectors resulting from region concatenation have dimensionality $5000\times 16=80,000$ (ESP) and $4,096\times 4=16,384$ (CIFAR-100), respectively. We apply Local Mutual Information (LMI, [13]) as weighting scheme and reduce the full co-occurrence space to 300 dimensions using the Singular Value Decomposition.

For CIFAR-100, we extract distinct visual vectors for single images. For ESP, given the size and amount of noise in this dataset, we build vectors for visual concepts, by normalizing and summing the BoVW vectors of all the images that have the relevant concept as a tag. Note that relevant literature [41] has emphasized the importance of learners self-generating multiple views when faced with new objects. Thus, our multiple-image assumption should not be considered as problematic in the current setup.

We implement the entire visual pipeline with VSEM, an open library for visual semantics [4].³³http://clic.cimec.unitn.it/vsem/

4.3 Linguistic Semantic Spaces

For constructing the text-based vectors, we follow a standard pipeline in distributional semantics [53] without tuning its parameters and collect co-occurrence statistics from the concatenation of ukWaC⁴⁴http://wacky.sslmit.unibo.it and the Wikipedia, amounting to 2.7 billion tokens in total. Semantic vectors are constructed for a set of 30K target words (lemmas), namely the top 20K most frequent nouns, 5K most frequent adjectives and 5K most frequent verbs, and the same 30K lemmas are also employed as contextual elements. We collect co-occurrences in a symmetric context window of 20 elements around a target word. Finally, similarly to the visual semantic space, raw counts are transformed by applying LMI and then reduced to 300 dimensions with SVD.⁵⁵We also experimented with the image- and text-based vectors of Socher et al. (2013), but achieved better performance with the reported setup.

4.4 Cross-modal Mapping

The process of learning to map objects to the their word label is implemented by training a projection function $\mathrm{f_{proj_{v\rightarrow w}}}$ from the visual onto the linguistic semantic space. For the learning, we use a set of $N_{s}$ seen concepts for which we have both image-based visual representations $\mathbf{V}_{s}\in\mathbb{R}^{N_{s}\times d_{v}}$ and text-based linguistic representations $\mathbf{W}_{s}\in\mathbb{R}^{N_{s}\times d_{w}}$ . The projection function is subject to an objective that aims at minimizing some cost function between the induced text-based representations $\mathbf{\hat{W}}_{s}\in\mathbb{R}^{N_{s}\times d_{w}}$ and the gold ones $\mathbf{W}_{s}$ . The induced $\mathrm{f_{proj_{v\rightarrow w}}}$ is then applied to the image-based representations $\mathbf{V}_{u}\in\mathbb{R}^{N_{u}\times d_{v}}$ of $N_{u}$ unseen objects to transform them into text-based representations $\mathbf{\hat{W}}_{u}\in\mathbb{R}^{N_{u}\times d_{w}}$ . We implement 4 alternative learning algorithms for inducing the cross-modal projection function $\mathrm{f_{proj_{v\rightarrow w}}}$ .

Linear Regression (lin)

Our first model is a very simple linear mapping between the two modalities estimated by solving a least-squares problem. This method is similar to the one introduced by Mikolov et al. (2013a) for estimating a translation matrix, only solved analytically. In our setup, we can see the two different modalities as if they were different languages. By using least-squares regression, the projection function $\mathrm{f_{proj_{v\rightarrow w}}}$ can be derived as

\mathrm{f_{proj_{v\rightarrow w}}}={(\mathbf{V}_{s}^{T}\mathbf{V}_{s})}^{-1}% \mathbf{V}_{s}^{T}\mathbf{W}_{s}

(1)

Canonical Correlation Analysis (CCA)

CCA [22, 27] and variations thereof have been successfully used in the past for annotation of regions [48] and complete images [21, 26]. Given two paired observation matrices, in our case $\mathbf{V}_{s}$ and $\mathbf{W}_{s}$ , CCA aims at capturing the linear relationship that exists between these variables. This is achieved by finding a pair of matrices, in our case $\mathbf{C}_{V}\in\mathbb{R}^{d_{v}\times d}$ and $\mathbf{C}_{W}\in\mathbb{R}^{d_{w}\times d}$ , such that the correlation between the projections of the two multidimensional variables into a common, lower-rank space is maximized. The resulting multimodal space has been shown to provide a good approximation to human concept similarity judgments [45]. In our setup, after applying CCA on the two spaces $\mathbf{V}_{s}$ and $\mathbf{W}_{s}$ , we obtain the two projection mappings onto the common space and thus our projection function can be derived as:

\mathrm{f_{proj_{v\rightarrow w}}}=\mathbf{C}_{V}{\mathbf{C}_{W}}^{-1}

(2)

Singular Value Decomposition (SVD)

SVD is the most widely used dimensionality reduction technique in distributional semantics [53], and it has recently been exploited to combine visual and linguistic dimensions in the multimodal distributional semantic model of Bruni et al. (2014). SVD smoothing is also a way to infer values of unseen dimensions in partially incomplete matrices, a technique that has been applied to the task of inferring word tags of unannotated images [23]. Assuming that the concept-representing rows of $\mathbf{V}_{s}$ and $\mathbf{W}_{s}$ are ordered in the same way, we apply the ( $k$ -truncated) SVD to the concatenated matrix $[\mathbf{V}_{s}\mathbf{W}_{s}]$ , such that $[\mathbf{\hat{V}}_{s}\mathbf{\hat{W}_{s}}]=\mathbf{U}_{k}\mathbf{\Sigma}_{k}% \mathbf{Z}_{k}^{T}$ is a $k$ -rank approximation of the original matrix.⁶⁶We denote the right singular vectors matrix by $\mathbf{Z}$ instead of the customary $\mathbf{V}$ to avoid confusion with the visual matrix. The projection function is then:

\mathrm{f_{proj_{v\rightarrow w}}}=\mathbf{Z}_{k}\mathbf{Z}_{k}^{T}

(3)

where the input is appropriately padded with 0s ( $[\mathbf{V}_{u}\mathbf{0}_{Nu\times{}W}]$ ) and we discard the visual block of the output matrix $[\mathbf{\hat{V}}_{u}\mathbf{\hat{W}}_{u}]$ .

Neural Network (NNet)

The last model that we introduce is a neural network with one hidden layer. The projection function in this model can be described as:

\mathrm{f_{proj_{v\rightarrow w}}}=\boldsymbol{\Theta}_{\mathrm{v\rightarrow w}}

(4)

where $\boldsymbol{\Theta}_{v\rightarrow w}$ consists of the model weights $\boldsymbol{\theta}^{(1)}\in\mathbb{R}^{d_{v}\times h}$ and $\boldsymbol{\theta}^{(2)}\in\mathbb{R}^{h\times d_{w}}$ that map the input image-based vectors $\mathbf{V}_{s}$ first to the hidden layer and then to the output layer in order to obtain text-based vectors, i.e., $\mathbf{\hat{W}}_{s}=\sigma^{(2)}(\sigma^{(1)}(\boldsymbol{\mathbf{V}_{s}% \theta}^{(1)})\boldsymbol{\theta}^{(2)})$ , where $\sigma^{(1)}$ and $\sigma^{(2)}$ are the non-linear activation functions. We experimented with sigmoid, hyperbolic tangent and linear; hyperbolic tangent yielded the highest performance. The weights are estimated by minimizing the objective function

J(\mathbf{\Theta}_{\mathrm{v\rightarrow w}})=\frac{1}{2}(1-sim(\mathbf{W}_{s},% \mathbf{\hat{W}}_{s}))

(5)

where $s i m$ is some similarity function. In our experiments we used cosine as similarity function, so that $sim(\mathbf{A},\mathbf{B})=\frac{AB}{\lVert A\rVert\lVert B\rVert}$ , thus penalizing parameter settings leading to a low cosine between the target linguistic representations $\mathbf{W}_{s}$ and those produced by the projection function $\mathbf{\hat{W}}_{s}$ . The cosine has been widely used in the distributional semantic literature, and it has been shown to outperform Euclidean distance [6].⁷⁷We also experimented with the same objective function as Socher et al. (2013), however, our objective function yielded consistently better results in all experimental settings. Parameters were estimated with standard backpropagation and L-BFGS.

5 Results

Our experiments focus on the tasks of zero-shot learning (Sections 5.1 and 5.2) and fast mapping (Section 5.3). In both tasks, the projected vector of the unseen concept is labeled with the word associated to its cosine-based nearest neighbor vector in the corresponding semantic space.

For the zero-shot task we report the accuracy of retrieving the correct label among the top $k$ neighbors from a semantic space populated with the union of seen and unseen concepts. For fast mapping, we report the mean rank of the correct concept among fast mapping candidates.

5.1 Zero-shot Learning in CIFAR-100

For this experiment, we use the intersection of our linguistic space with the concepts present in CIFAR-100, containing a total of 90 concepts. For each concept category, we treat all concepts but one as seen concepts (Table 1). The 71 seen concepts correspond to 42,600 distinct visual vectors and are used to induce the projection function. Table 2 reports results obtained by averaging the performance on the 11,400 distinct vectors of the 19 unseen concepts.

Our 4 models introduced in Section 4.4 are compared to a theoretically derived baseline Chance simulating selecting a label at random. For the neural network NN, we use prior knowledge about the number of concept categories to set the number of hidden units to 20 in order to avoid tuning of this parameter. For the SVD model, we set the number of dimensions to 300, a common choice in distributional semantics, coherent with the settings we used for the visual and linguistic spaces.

First and foremost, all 4 models outperform Chance by a large margin. Surprisingly, the very simple lin method outperforms both CCA and SVD. However, NN, an architecture that can capture more complex, non-linear relations in features across modalities, emerges as the best performing model, confirming on a larger scale the recent findings of Socher et al. (2013).

\backslashboxModelk	1	2	3	5	10	20
Chance	1.1	2.2	3.3	5.5	11.0	22.0
SVD	1.9	5.0	8.1	14.5	29.0	48.6
CCA	3.0	6.9	10.7	17.9	31.7	51.7
lin	2.4	6.4	10.5	18.7	33.0	55.0
NN	3.9	6.6	10.6	21.9	37.9	58.2

Table 2: Percentage accuracy among top k nearest neighbors on CIFAR-100.

5.1.1 Concept Categorization

In order to gain qualitative insights into the performance of the projection process of NN, we attempt to investigate the role and interpretability of the hidden layer. We achieve this by looking at which visual concepts result in the highest hidden unit activation.⁸⁸For this post-hoc analysis, we include a sparsity parameter in the objective function of Equation 5 in order to get more interpretable results; hidden units are therefore maximally activated by a only few concepts. This is inspired by analogous qualitative analysis conducted in Topic Models [20], where “topics” are interpreted in terms of the words with the highest probability under each of them.

Table 3 presents both seen and unseen concepts corresponding to visual vectors that trigger the highest activation for a subset of hidden units. The table further reports, for each hidden unit, the “correct” unseen concept for the category of the top seen concepts, together with its rank in terms of activation of the unit. The analysis demonstrates that, although prior knowledge about categories was not explicitly used to train the network, the latter induced an organization of concepts into superordinate categories in which the hidden layer acts as a cross-modal concept categorization/organization system. When the induced projection function maps an object onto the linguistic space, the derived text vector will inherit a mixture of textual features from the concepts that activated the same hidden unit as the object. This suggests a bias towards seen concepts. Furthermore, in many cases of miscategorization, the concepts are still semantically coherent with the induced category, confirming that the projection function is indeed capturing a latent, cross-modal semantic space. A squirrel, although not a “large omnivore”, is still an animal, while butterflies are not flowers but often feed on their nectar.

	Seen Concepts	Unseen Concept	Rank of Correct	CIFAR-100 Category
			Unseen Concept
Unit 1	sunflower, tulip, pear	butterfly	2 (rose)	flowers
Unit 2	cattle, camel, bear	squirrel	2 (elephant)	large omnivores and herbivores
Unit 3	castle, bridge, house	bus	4 (skyscraper)	large man-made outdoor things
Unit 4	man, girl, baby	boy	1	people
Unit 5	motorcycle, bicycle, tractor	streetcar	2 (bus)	vehicles 1
Unit 6	sea, plain, cloud	forest	1	large natural outdoor scenes
Unit 7	chair, couch, table	bed	1	household furniture
Unit 8	plate, bowl, can	clock	3 (cup)	food containers
Unit 9	apple, pear, mushroom	orange	1	fruit and vegetables

Table 3: Categorization induced by the hidden layer of the NN; concepts belonging in the same CIFAR-100 categories, reported in the last column, are marked in bold. Example: Unit 1 receives the highest activation during training by the category flowers and at test time by butterfly, belonging to insects. The same unit receives the second highest activation by the “correct” test concept, the flower rose.

5.2 Zero-shot Learning in ESP

For this experiment, we focus on NN, the best performing model in the previous experiment. We use a set of approximately 9,500 concepts, the intersection of the ESP-based visual semantic space with the linguistic space. For tuning the number of hidden units of NN, we use the MEN-concrete dataset of Bruni et al. (2014). Finally, we randomly pick 70% of the concepts to induce the projection function $\mathrm{f_{proj_{v\rightarrow w}}}$ and report results on the remaining 30%. Note that the search space for the correct label in this experiment is approximately 95 times larger than the one used for the experiment presented in Section 5.1.

Although our experimental setup differs from the one of Frome et al. (2013), thus preventing a direct comparison, the results reported in Table 5 are on a comparable scale to theirs. We note that previous work on zero-shot learning has used standard object recognition benchmarks. To the best of our knowledge, this is the first time this task has been performed on a dataset as noisy as ESP. Overall, the results suggest that cross-modal mapping could be applied in tasks where images exhibit a more complex structure, e.g., caption generation and event recognition.

Unseen Concept	Nearest Neighbors
tiger	cat, microchip, kitten, vet, pet
bike	spoke, wheel, brake, tyre, motorcycle
blossom	bud, leaf, jasmine, petal, dandelion
bakery	quiche, bread, pie, bagel, curry

Table 4: Top 5 neighbors in linguistic space after visual vector projection of 4 unseen concepts.

\backslashbox[1cm]Modelk	1	2	5	10	50
Chance	0.01	0.02	0.05	0.10	0.5
NN	0.8	1.9	5.6	9.7	30.9

Table 5: Percentage accuracy among top k nearest neighbors on ESP.

5.3 Fast Mapping in ESP

In this section, we aim at simulating a fast mapping scenario in which the learner has been just exposed to a new concept, and thus has limited linguistic evidence for that concept. We operationalize this by considering the 34 concrete concepts introduced by Frassinelli and Keller (2012), and deriving their text-based representations from just a few sentences randomly picked from the corpus. Concretely, we implement 5 models: context 1, context 5, context 10, context 20 and context full, where the name of the model denotes the number of sentences used to construct the text-based representations. The derived vectors were reduced with the same SVD projection induced from the complete corpus. Cross-modal mapping is done via NN.

The zero-shot framework leads us to frame fast mapping as the task of projecting visual representations of new objects onto language space for retrieving their word labels ( $\mathrm{v\rightarrow w}$ ). This mapping from visual to textual representations is arguably a more plausible task than vice versa. If we think about how linguistic reference is acquired, a scenario in which a learner first encounters a new object and then seeks its reference in the language of the surrounding environment (e.g., adults having a conversation, the text of a book with an illustration of an unknown object) is very natural. Furthermore, since not all new concepts in the linguistic environment refer to new objects (they might denote abstract concepts or out-of-scene objects), it seems more reasonable for the learner to be more alerted to linguistic cues about a recently-spotted new object than vice versa. Moreover, once the learner observes a new object, she can easily construct a full visual representation for it (and the acquisition literature has shown that humans are wired for good object segmentation and recognition [50]) – the more challenging task is to scan the ongoing and very ambiguous linguistic communication for contexts that might be relevant and informative about the new object. However, fast mapping is often described in the psychological literature as the opposite task: The learner is exposed to a new word in context and has to search for the right object referring to it. We implement this second setup ( $\mathrm{w\rightarrow v}$ ) by training the projection function $\mathrm{f_{proj_{w\rightarrow v}}}$ which maps linguistic vectors to visual ones. The adaptation of NN is straightforward; the new objective function is derived as

J(\mathbf{\Theta}_{\mathrm{w\rightarrow v}})=\frac{1}{2}(1-sim(\mathbf{V}_{s},% \mathbf{\hat{V}}_{s}))

(6)

where $\mathbf{\hat{V}}_{s}=\sigma^{(2)}(\sigma^{(1)}(\boldsymbol{\mathbf{W}_{s}% \theta}^{(1)})\boldsymbol{\theta}^{(2)})$ , $\boldsymbol{\theta}^{(1)}\in\mathbb{R}^{d_{w}\times h}$ and $\boldsymbol{\theta}^{(2)}\in\mathbb{R}^{h\times d_{v}}$ .

Table 7 presents the results. Not surprisingly, performance increases with the number of sentences that are used to construct the textual representations. Furthermore, all models perform better than Chance, including those that are based on just 1 or 5 sentences. This suggests that the system can make reasonable inferences about object-word connections even when linguistic evidence is very scarce.

$\mathrm{v}$ $\rightarrow$	$\mathrm{w}$	$\mathrm{w}$ $\rightarrow$	$\mathrm{v}$
cooker $\rightarrow$	potato	dishwasher $\rightarrow$	corkscrew
clarinet $\rightarrow$	drum	potato $\rightarrow$	corn
gorilla $\rightarrow$	elephant	guitar $\rightarrow$	violin
scooter $\rightarrow$	car	scarf $\rightarrow$	trouser

Table 6: Top-ranked concepts in cases where the gold concepts received numerically high ranks.

Regarding the sources of error, a qualitative analysis of predicted word labels and objects as presented in Table 6 suggests that both textual and visual representations, although capturing relevant “topical” or “domain” information, are not enough to single out the properties of the target concept. As an example, the textual vector of dishwasher contains kitchen-related dimensions such as $\langle$ fridge, oven, gas, hob, …, sink $\rangle$ . After projecting onto the visual space, its nearest visual neighbours are the visual ones of the same-domain concepts corkscrew and kettle. The latter is shown in Figure 5, with a gas hob well in evidence. As a further example, the visual vector for cooker is extracted from pictures such as the one in Figure 5. Not surprisingly, when projecting it onto the linguistic space, the nearest neighbours are other kitchen-related terms, i.e., potato and dishwasher.

\backslashboxContextMapping	$\mathrm{v\rightarrow w}$	$\mathrm{w\rightarrow v}$
Chance	17	17
context 1	12.6	14.5
context 5	8.08	13.29
context 10	7.29	13.44
context 20	6.02	12.17
context full	5.52	5.88

Table 7: Mean rank results averaged across 34 concepts when mapping an image-based vector and retrieving its linguistic neighbors (

\mathrm{v\rightarrow w}

) as well as when mapping a text-based vector and retrieving its visual neighbors (

\mathrm{w\rightarrow v}

). Lower numbers cue better performance.

{subfigure}

.25

Figure 3: A kettle

{subfigure}

.25

Figure 4: A cooker

Figure 5: Two images from ESP.

6 Conclusion

At the outset of this work, we considered the problem of linking purely language-based distributional semantic spaces with objects in the visual world by means of cross-modal mapping. We compared recent models for this task both on a benchmark object recognition dataset and on a more realistic and noisier dataset covering a wide range of concepts. The neural network architecture emerged as the best performing approach, and our qualitative analysis revealed that it induced a categorical organization of concepts. Most importantly, our results suggest the viability of cross-modal mapping for grounded word-meaning acquisition in a simulation of fast mapping.

Given the success of NN, we plan to experiment in the future with more sophisticated neural network architectures inspired by recent work in machine translation [19] and multimodal deep learning [51]. Furthermore, we intend to adopt visual attributes [14, 44] as visual representations, since they should allow a better understanding of how cross-modal mapping works, thanks to their linguistic interpretability. The error analysis in Section 5.3 suggests that automated localization techniques [54], distinguishing an object from its surroundings, might drastically improve mapping accuracy. Similarly, in the textual domain, models that extract collocates of a word that are more likely to denote conceptual properties [28] might lead to more informative and discriminative linguistic vectors. Finally, the lack of large child-directed speech corpora constrained the experimental design of fast mapping simulations; we plan to run more realistic experiments with true nonce words and using source corpora (e.g., the Simple Wikipedia, child stories, portions of CHILDES) that contain sentences more akin to those a child might effectively hear or read in her word-learning years.

Acknowledgments

We thank Adam Liška for helpful discussions and the 3 anonymous reviewers for useful comments. This work was supported by ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES).

References

[1] B. Abbott(2010) Reference. Oxford University Press, Oxford, UK. Cited by: 1.
[2] P. Bloom(2000) How children learn the meanings of words. MIT Press, Cambridge, MA. Cited by: 1.
[3] S. R. K. Branavan, H. Chen, L. S. Zettlemoyer and R. Barzilay(2009) Reinforcement learning for mapping instructions to actions. pp. 82–90. Cited by: 1.
[4] E. Bruni, U. Bordignon, A. Liska, J. Uijlings and I. Sergienya(2013) VSEM: an open library for visual semantics representation. Sofia, Bulgaria. Cited by: 4.2.
[5] E. Bruni, N. K. Tran and M. Baroni(2014) Multimodal distributional semantics. Journal of Artificial Intelligence Research 49, pp. 1–47. Cited by: 1, 4.2, 4.4, 5.2.
[6] J. Bullinaria and J. Levy(2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods 39, pp. 510–526. Cited by: 4.4.
[7] S. Carey and E. Bartlett(1978) Acquiring a single new word. Papers and Reports on Child Language Development 15, pp. 17–29. Cited by: 1.
[8] S. Carey(1978) The child as a word learner. in M. Halle, J. Bresnan and G. Miller (Eds.), Linguistics Theory and Psychological Reality, Cited by: 1.
[9] K. Chatfield, V. Lempitsky, A. Vedaldi and A. Zisserman(2011) The devil is in the details: an evaluation of recent feature encoding methods. Dundee, UK. Cited by: 4.2.
[10] D. Chen and R. Mooney(2011) Learning to interpret natural language navigation instructions from observations. San Francisco, CA, pp. 859–865. Cited by: 1.
[11] R. Collobert and J. Weston(2008) A unified architecture for natural language processing: deep neural networks with multitask learning. Helsinki, Finland, pp. 160–167. Cited by: 1.
[12] G. Csurka, C. Dance, L. Fan, J. Willamowski and C. Bray(2004) Visual categorization with bags of keypoints. Prague, Czech Republic, pp. 1–22. Cited by: 4.2.
[13] S. Evert(2005) The statistics of word cooccurrences. Ph.D dissertation, Stuttgart University. Cited by: 4.2.
[14] A. Farhadi, I. Endres, D. Hoiem and D. Forsyth(2009) Describing objects by their attributes. Miami Beach, FL, pp. 1778–1785. Cited by: 6.
[15] A. Fazly, A. Alishahi and S. Stevenson(2010) A probabilistic computational model of cross-situational word learning. Cognitive Science 34, pp. 1017–1063. Cited by: 2.
[16] Y. Feng and M. Lapata(2010) Visual information in semantic representation. Los Angeles, CA, pp. 91–99. Cited by: 1.
[17] D. Frassinelli and F. Keller(2012) The plausibility of semantic properties generated by a distributional model: evidence from a visual world experiment. pp. 1560–1565. Cited by: 5.3.
[18] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato and T. Mikolov(2013) DeViSE: a deep visual-semantic embedding model. Lake Tahoe, Nevada, pp. 2121–2129. Cited by: 1, 1, 2, 2, 5.2.
[19] J. Gao, X. He, W. Yih and L. Deng(2013) Learning semantic representations for the phrase translation model. arXiv preprint arXiv:1312.0482. Cited by: 6.
[20] T. Griffiths, M. Steyvers and J. Tenenbaum(2007) Topics in semantic representation. Psychological Review 114, pp. 211–244. Cited by: 1, 5.1.1.
[21] D. R. Hardoon, C. Saunders, S. Szedmak and J. Shawe-Taylor(2006) A correlation approach for automatic image annotation. Advanced Data Mining and Applications, pp. 681–692. Cited by: 4.4.
[22] D. R. Hardoon, S. Szedmak and J. Shawe-Taylor(2004) Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16 (12), pp. 2639–2664. Cited by: 4.4.
[23] J. Hare, S. Samangooei, P. Lewis and M. Nixon(2008) Semantic spaces revisited: Investigating the performance of auto-annotation and semantic retrieval using semantic spaces. Niagara Falls, Canada, pp. 359–368. Cited by: 4.4.
[24] S. Harnad(1990) The symbol grounding problem. Physica D: Nonlinear Phenomena 42 (1-3), pp. 335–346. Cited by: 1.
[25] T. Heibeck and E. Markman(1987) Word learning in children: an examination of fast mapping. Child Development 58, pp. 1021–1024. Cited by: 1.
[26] M. Hodosh, P. Young and J. Hockenmaier(2013) Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47, pp. 853–899. Cited by: 4.4.
[27] H. Hotelling(1936) Relations between two sets of variates. Biometrika 28 (3/4), pp. 321–377. Cited by: 4.4.
[28] C. Kelly, B. Devereux and A. Korhonen(2012) Semi-supervised learning for automatic conceptual property extraction. Montreal, Canada, pp. 11–20. Cited by: 6.
[29] A. Krizhevsky, I. Sutskever and G. Hinton(2012) Imagenet classification with deep convolutional neural networks. pp. 1106–1114. Cited by: 2.
[30] A. Krizhevsky(2009) Learning multiple layers of features from tiny images. Master’s Thesis, Citeseer. Cited by: 4.1.
[31] T. Landauer and S. Dumais(1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104 (2), pp. 211–240. Cited by: 1.
[32] S. Lazebnik, C. Schmid and J. Ponce(2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Washington, DC, pp. 2169–2178. Cited by: 4.2.
[33] A. Lenci(2008) Distributional approaches in linguistic and cognitive research. Italian Journal of Linguistics 20 (1), pp. 1–31. Cited by: 1.
[34] C. W. Leong and R. Mihalcea(2011) Going beyond text: a hybrid image-text approach for measuring word relatedness. pp. 1403–1407. Cited by: 1.
[35] D. Lowe(2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2). Cited by: 4.2.
[36] K. Lund and C. Burgess(1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods 28, pp. 203–208. Cited by: 1.
[37] T. Mikolov, Q. V. Le and I. Sutskever(2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Cited by: 4.4.
[38] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean(2013) Distributed representations of words and phrases and their compositionality. Lake Tahoe, Nevada, pp. 3111–3119. Cited by: 1.
[39] T. Mikolov, W. Yih and G. Zweig(2013) Linguistic regularities in continuous space word representations. Atlanta, Georgia, pp. 746–751. Cited by: 2.
[40] M. Palatucci, D. Pomerleau, G. Hinton and T. Mitchell(2009) Zero-shot learning with semantic output codes. Vancouver, Canada, pp. 1410–1418. Cited by: 1.
[41] A. F. Pereira, K. H. James, S. S. Jones and L. B. Smith(2010) Early biases and developmental changes in self-generated object views. Journal of vision 10 (11). Cited by: 4.2.
[42] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele and M. Pinkal(2013) Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1, pp. 25–36. Cited by: 1.
[43] J. Searle(1984) Minds, brains and science. Harvard University Press, Cambridge, MA. Cited by: 1.
[44] C. Silberer, V. Ferrari and M. Lapata(2013) Models of semantic representation with visual attributes. Sofia, Bulgaria, pp. 572–582. Cited by: 1, 6.
[45] C. Silberer and M. Lapata(2012) Grounded models of semantic representation. Jeju, Korea, pp. 1423–1433. Cited by: 4.4.
[46] J. Siskind(1996) A computational study of cross-situational techniques for learning word-to-meaning mappings. Cognition 61, pp. 39–91. Cited by: 2.
[47] J. Sivic and A. Zisserman(2003) Video Google: A text retrieval approach to object matching in videos. Nice, France, pp. 1470–1477. Cited by: 4.2.
[48] R. Socher and L. Fei-Fei(2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. pp. 966–973. Cited by: 4.4.
[49] R. Socher, M. Ganjoo, C. Manning and A. Ng(2013) Zero-shot learning through cross-modal transfer. Lake Tahoe, Nevada, pp. 935–943. Cited by: 1, 2, 2, 4.3, 4.4, 5.1.
[50] E. Spelke(1994) Initial knowledge: six suggestions. Cognition 50, pp. 431–445. Cited by: 5.3.
[51] N. Srivastava and R. Salakhutdinov(2012) Multimodal learning with deep boltzmann machines. pp. 2231–2239. Cited by: 6.
[52] R. Szeliski(2010) Computer vision : algorithms and applications. Springer, Berlin. Cited by: 4.2.
[53] P. Turney and P. Pantel(2010) From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37, pp. 141–188. Cited by: 1, 4.3, 4.4.
[54] K. van de Sande, J. Uijlings, T. Gevers and A. Smeulders(2011) Segmentation as selective search for object recognition. Barcelona, Spain, pp. 1879–1886. Cited by: 6.
[55] L. Von Ahn(2006) Games with a purpose. Computer 29 (6), pp. 92–94. Cited by: 4.1.
[56] J. Yang, Y. Jiang, A. Hauptmann and C. Ngo(2007) Evaluating bag-of-visual-words representations in scene classification. pp. 197–206. Cited by: 4.2.
[57] H. Yu and J. Siskind(2013) Grounded language learning from video described with sentences. Sofia, Bulgaria, pp. 53–63. Cited by: 2.

Generated on Tue Jun 10 19:20:23 2014 by LaTeXML [LOGO]

Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world

Abstract

1 Introduction

2 Related Work

3 Zero-shot learning and fast mapping

4 Experimental Setup

4.1 Visual Datasets

CIFAR-100

ESP

4.2 Visual Semantic Spaces

4.3 Linguistic Semantic Spaces

4.4 Cross-modal Mapping

Linear Regression (lin)

Canonical Correlation Analysis (CCA)

Singular Value Decomposition (SVD)

Neural Network (NNet)

5 Results

5.1 Zero-shot Learning in CIFAR-100

5.1.1 Concept Categorization

5.2 Zero-shot Learning in ESP

5.3 Fast Mapping in ESP

6 Conclusion

Acknowledgments

References

Is this a wampimuk?
Cross-modal mapping between distributional semantics
and the visual world