Following up on recent work on establishing a mapping between vector-based semantic embeddings of words and the visual representations of the corresponding objects from natural images, we first present a simple approach to cross-modal vector-based semantics for the task of zero-shot learning, in which an image of a previously unseen object is mapped to a linguistic representation denoting its word. We then introduce fast mapping, a challenging and more cognitively plausible variant of the zero-shot task, in which the learner is exposed to new objects and the corresponding words in very limited linguistic contexts. By combining prior linguistic and visual knowledge acquired about words and their objects, as well as exploiting the limited new evidence available, the learner must learn to associate new objects with words. Our results on this task pave the way to realistic simulations of how children or robots could use existing knowledge to bootstrap grounded semantic knowledge about new concepts.
Computational models of meaning that rely on corpus-extracted context vectors, such as LSA [31], HAL [36], Topic Models [20] and more recent neural-network approaches [11, 38] have successfully tackled a number of lexical semantics tasks, where context vector similarity highly correlates with various indices of semantic relatedness [53]. Given that these models are learned from naturally occurring data using simple associative techniques, various authors have advanced the claim that they might be also capturing some crucial aspects of how humans acquire and use language [31, 33].
However, the models induce the meaning of words entirely from their co-occurrence with other words, without links to the external world. This constitutes a serious blow to claims of cognitive plausibility in at least two respects. One is the grounding problem [24, 43]. Irrespective of their relatively high performance on various semantic tasks, it is debatable whether models that have no access to visual and perceptual information can capture the holistic, grounded knowledge that humans have about concepts. However, a possibly even more serious pitfall of vector models is lack of reference: natural language is, fundamentally, a means to communicate, and thus our words must be able to refer to objects, properties and events in the outside world [1]. Current vector models are purely language-internal, solipsistic models of meaning. Consider the very simple scenario in which visual information is being provided to an agent about the current state of the world, and the agent’s task is to determine the truth of a statement similar to There is a dog in the room. Although the agent is equipped with a powerful context vector model, this will not suffice to successfully complete the task. The model might suggest that the concepts of dog and cat are semantically related, but it has no means to determine the visual appearance of dogs, and consequently no way to verify the truth of such a simple statement.
Mapping words to the objects they denote is such a core function of language that humans are highly optimized for it, as shown by the so-called fast mapping phenomenon, whereby children can learn to associate a word to an object or property by a single exposure to it [2, 8, 7, 25]. But lack of reference is not only a theoretical weakness: Without the ability to refer to the outside world, context vectors are arguably useless for practical goals such as learning to execute natural language instructions [3, 10], that could greatly benefit from the rich network of lexical meaning such vectors encode, in order to scale up to real-life challenges.
Very recently, a number of papers have exploited advances in automated feature extraction form images and videos to enrich context vectors with visual information [5, 16, 34, 42, 44]. This line of research tackles the grounding problem: Word representations are no longer limited to their linguistic contexts but also encode visual information present in images associated with the corresponding objects. In this paper, we rely on the same image analysis techniques but instead focus on the reference problem: We do not aim at enriching word representations with visual information, although this might be a side effect of our approach, but we address the issue of automatically mapping objects, as depicted in images, to the context vectors representing the corresponding words. This is achieved by means of a simple neural network trained to project image-extracted feature vectors to text-based vectors through a hidden layer that can be interpreted as a cross-modal semantic space.
We first test the effectiveness of our cross-modal semantic space on the so-called zero-shot learning task [40], which has recently been explored in the machine learning community [18, 49]. In this setting, we assume that our system possesses linguistic and visual information for a set of concepts in the form of text-based representations of words and image-based vectors of the corresponding objects, used for vision-to-language-mapping training. The system is then provided with visual information for a previously unseen object, and the task is to associate it with a word by cross-modal mapping. Our approach is competitive with respect to the recently proposed alternatives, while being overall simpler.
The aforementioned task is very demanding and interesting from an engineering point of view. However, from a cognitive angle, it relies on strong, unrealistic assumptions: The learner is asked to establish a link between a new object and a word for which they possess a full-fledged text-based vector extracted from a billion-word corpus. On the contrary, the first time a learner is exposed to a new object, the linguistic information available is likely also very limited. Thus, in order to consider vision-to-language mapping under more plausible conditions, similar to the ones that children or robots in a new environment are faced with, we next simulate a scenario akin to fast mapping. We show that the induced cross-modal semantic space is powerful enough that sensible guesses about the correct word denoting an object can be made, even when the linguistic context vector representing the word has been created from as little as 1 sentence containing it.
The contributions of this work are three-fold. First, we conduct experiments with simple image- and text-based vector representations and compare alternative methods to perform cross-modal mapping. Then, we complement recent work [18] and show that zero-shot learning scales to a large and noisy dataset. Finally, we provide preliminary evidence that cross-modal projections can be used effectively to simulate a fast mapping scenario, thus strengthening the claims of this approach as a full-fledged, fully inductive theory of meaning acquisition.
The problem of establishing word reference has been extensively explored in computational simulations of cross-situational learning (see Fazly et al. (2010) for a recent proposal and extended review of previous work). This line of research has traditionally assumed artificial models of the external world, typically a set of linguistic or logical labels for objects, actions and possibly other aspects of a scene [46]. Recently, Yu and Siskind (2013) presented a system that induces word-object mappings from features extracted from short videos paired with sentences. Our work complements theirs in two ways. First, unlike Yu and Siskind (2013) who considered a limited lexicon of 15 items with only 4 nouns, we conduct experiments in a large search space containing a highly ambiguous set of potential target words for every object (see Section 4.1). Most importantly, by projecting visual representations of objects into a shared semantic space, we do not limit ourselves to establishing a link between objects and words. We induce a rich semantic representation of the multimodal concept, that can lead, among other things, to the discovery of important properties of an object even when we lack its linguistic label. Nevertheless, Yu and Siskind’s system could in principle be used to initialize the vision-language mapping that we rely upon.
Closer to the spirit of our work are two very recent studies coming from the machine learning community. Socher et al. (2013) and Frome et al. (2013) focus on zero-shot learning in the vision-language domain by exploiting a shared visual-linguistic semantic space. Socher et al. (2013) learn to project unsupervised vector-based image representations onto a word-based semantic space using a neural network architecture. Unlike us, Socher and colleagues train an outlier detector to decide whether a test image should receive a known-word label by means of a standard supervised object classifier, or be assigned an unseen label by vision-to-language mapping. In our zero-shot experiments, we assume no access to an outlier detector, and thus, the search for the correct label is performed in the full concept space. Furthermore, Socher and colleagues present a much more constrained evaluation setup, where only 10 concepts are considered, compared to our experiments with hundreds or thousands of concepts.
Frome et al. (2013) use linear regression to transform vector-based image representations onto vectors representing the same concepts in linguistic semantic space. Unlike Socher et al. (2013) and the current study that adopt simple unsupervised techniques for constructing image representations, Frome et al. (2013) rely on a supervised state-of-the-art method: They feed low-level features to a deep neural network trained on a supervised object recognition task [29]. Furthermore, their text-based vectors encode very rich information, such as [39]. A natural question we aim to answer is whether the success of cross-modal mapping is due to the high-quality embeddings or to the general algorithmic design. If the latter is the case, then these results could be extended to traditional distributional vectors bearing other desirable properties, such as high interpretability of dimensions.
“We found a cute, hairy wampimuk sleeping behind the tree.” Even though the previous statement is certainly the first time one hears about wampimuks, the linguistic context already creates some visual expectations: Wampimuks probably resemble small animals (Figure 1). This is the scenario of zero-shot learning. Moreover, if this is also the first linguistic encounter of that concept, then we refer to the task as fast mapping.
[b].25 \subcaption {subfigure}[b].24 \subcaption
Concretely, we assume that concepts, denoted for convenience by word labels, are represented in linguistic terms by vectors in a text-based distributional semantic space (see Section 4.3). Objects corresponding to concepts are represented in visual terms by vectors in an image-based semantic space (Section 4.2). For a subset of concepts (e.g., a set of animals, a set of vehicles), we possess information related to both their linguistic and visual representations. During training, this cross-modal vocabulary is used to induce a projection function (Section 4.4), which – intuitively – represents a mapping between visual and linguistic dimensions. Thus, this function, given a visual vector, returns its corresponding linguistic representation. At test time, the system is presented with a previously unseen object (e.g., wampimuk). This object is projected onto the linguistic space and associated with the word label of the nearest neighbor in that space (degus in Figure 1).
The fast mapping setting can be seen as a special case of the zero-shot task. Whereas for the latter our system assumes that all concepts have rich linguistic representations (i.e., representations estimated from a large corpus), in the case of the former, new concepts are assumed to be encounted in a limited linguistic context and therefore lacking rich linguistic representations. This is operationalized by constructing the text-based vector for these concepts from a context of just a few occurrences. In this way, we simulate the first encounter of a learner with a concept that is new in both visual and linguistic terms.
The CIFAR-100 dataset [30] consists of 60,000 32x32 colour images (note the extremely small size) representing 100 distinct concepts, with 600 images per concept. The dataset covers a wide range of concrete domains and is organized into 20 broader categories. Table 1 lists the concepts used in our experiments organized by category.
Category | Seen Concepts | Unseen (Test) Concepts |
---|---|---|
aquatic mammals | beaver, otter, seal, whale | dolphin |
fish | ray, trout | shark |
flowers | orchid, poppy, sunflower, tulip | rose |
food containers | bottle, bowl, can ,plate | cup |
fruit vegetable | apple, mushroom, pear | orange |
household electrical devices | keyboard, lamp, telephone, television | clock |
household furniture | chair, couch, table, wardrobe | bed |
insects | bee, beetle, caterpillar, cockroach | butterfly |
large carnivores | bear, leopard, lion, wolf | tiger |
large man-made outdoor things | bridge, castle, house, road | skyscraper |
large natural outdoor scenes | cloud, mountain, plain, sea | forest |
large omnivores and herbivores | camel, cattle, chimpanzee, kangaroo | elephant |
medium-sized mammals | fox, porcupine, possum, skunk | raccoon |
non-insect invertebrates | crab, snail, spider, worm | lobster |
people | baby, girl, man, woman | boy |
reptiles | crocodile, dinosaur, snake, turtle | lizard |
small mammals | hamster, mouse, rabbit, shrew | squirrel |
vehicles 1 | bicycle, motorcycle, train | bus |
vehicles 2 | rocket, tank, tractor | streetcar |
Our second dataset consists of 100K images from the ESP-Game data set, labeled through a “game with a purpose” [55].11http://www.cs.cmu.edu/~biglou/resources/ The ESP image tags form a vocabulary of 20,515 unique words. Unlike other datasets used for zero-shot learning, it covers adjectives and verbs in addition to nouns. On average, an image has 14 tags and a word appears as a tag for 70 images. Unlike the CIFAR-100 images, which were chosen specifically for image object recognition tasks (i.e., each image is clearly depicting a single object in the foreground), ESP contains a random selection of images from the Web. Consequently, objects do not appear in most images in their prototypical display, but rather as elements of complex scenes (see Figure 2). Thus, ESP constitutes a more realistic, and at the same time more challenging, simulation of how things are encountered in real life, testing the potentials of cross-modal mapping in dealing with the complex scenes that one would encounter in event recognition and caption generation tasks.
.25 {subfigure}.22
Image-based vectors are extracted using the unsupervised bag-of-visual-words (BoVW) representational architecture [47, 12], that has been widely and successfully applied to computer vision tasks such as object recognition and image retrieval [56]. First, low-level visual features [52] are extracted from a large collection of images and clustered into a set of “visual words”. The low-level features of a specific image are then mapped to the corresponding visual words, and the image is represented by a count vector recording the number of occurrences of each visual word in it. We do not attempt any parameter tuning of the pipeline.
As low-level features, we use Scale Invariant Feature Transform (SIFT) features [35]. SIFT features are tailored to capture object parts and to be invariant to several image transformations such as rotation, illumination and scale change. These features are clustered into vocabularies of 5,000 (ESP) and 4,096 (CIFAR-100) visual words.22For selecting the size of the vocabulary size, we relied on standard settings found in the relevant literature [5, 9]. To preserve spatial information in the BoVW representation, we use the spatial pyramid technique [32], which consists in dividing the image into several regions, computing BoVW vectors for each region and concatenating them. In particular, we divide ESP images into 16 regions and the smaller CIFAR-100 images into 4. The vectors resulting from region concatenation have dimensionality (ESP) and (CIFAR-100), respectively. We apply Local Mutual Information (LMI, [13]) as weighting scheme and reduce the full co-occurrence space to 300 dimensions using the Singular Value Decomposition.
For CIFAR-100, we extract distinct visual vectors for single images. For ESP, given the size and amount of noise in this dataset, we build vectors for visual concepts, by normalizing and summing the BoVW vectors of all the images that have the relevant concept as a tag. Note that relevant literature [41] has emphasized the importance of learners self-generating multiple views when faced with new objects. Thus, our multiple-image assumption should not be considered as problematic in the current setup.
We implement the entire visual pipeline with VSEM, an open library for visual semantics [4].33http://clic.cimec.unitn.it/vsem/
For constructing the text-based vectors, we follow a standard pipeline in distributional semantics [53] without tuning its parameters and collect co-occurrence statistics from the concatenation of ukWaC44http://wacky.sslmit.unibo.it and the Wikipedia, amounting to 2.7 billion tokens in total. Semantic vectors are constructed for a set of 30K target words (lemmas), namely the top 20K most frequent nouns, 5K most frequent adjectives and 5K most frequent verbs, and the same 30K lemmas are also employed as contextual elements. We collect co-occurrences in a symmetric context window of 20 elements around a target word. Finally, similarly to the visual semantic space, raw counts are transformed by applying LMI and then reduced to 300 dimensions with SVD.55We also experimented with the image- and text-based vectors of Socher et al. (2013), but achieved better performance with the reported setup.
The process of learning to map objects to the their word label is implemented by training a projection function from the visual onto the linguistic semantic space. For the learning, we use a set of seen concepts for which we have both image-based visual representations and text-based linguistic representations . The projection function is subject to an objective that aims at minimizing some cost function between the induced text-based representations and the gold ones . The induced is then applied to the image-based representations of unseen objects to transform them into text-based representations . We implement 4 alternative learning algorithms for inducing the cross-modal projection function .
Our first model is a very simple linear mapping between the two modalities estimated by solving a least-squares problem. This method is similar to the one introduced by Mikolov et al. (2013a) for estimating a translation matrix, only solved analytically. In our setup, we can see the two different modalities as if they were different languages. By using least-squares regression, the projection function can be derived as
(1) |
CCA [22, 27] and variations thereof have been successfully used in the past for annotation of regions [48] and complete images [21, 26]. Given two paired observation matrices, in our case and , CCA aims at capturing the linear relationship that exists between these variables. This is achieved by finding a pair of matrices, in our case and , such that the correlation between the projections of the two multidimensional variables into a common, lower-rank space is maximized. The resulting multimodal space has been shown to provide a good approximation to human concept similarity judgments [45]. In our setup, after applying CCA on the two spaces and , we obtain the two projection mappings onto the common space and thus our projection function can be derived as:
(2) |
SVD is the most widely used dimensionality reduction technique in distributional semantics [53], and it has recently been exploited to combine visual and linguistic dimensions in the multimodal distributional semantic model of Bruni et al. (2014). SVD smoothing is also a way to infer values of unseen dimensions in partially incomplete matrices, a technique that has been applied to the task of inferring word tags of unannotated images [23]. Assuming that the concept-representing rows of and are ordered in the same way, we apply the (-truncated) SVD to the concatenated matrix , such that is a -rank approximation of the original matrix.66We denote the right singular vectors matrix by instead of the customary to avoid confusion with the visual matrix. The projection function is then:
(3) |
where the input is appropriately padded with 0s () and we discard the visual block of the output matrix .
The last model that we introduce is a neural network with one hidden layer. The projection function in this model can be described as:
(4) |
where consists of the model weights and that map the input image-based vectors first to the hidden layer and then to the output layer in order to obtain text-based vectors, i.e., , where and are the non-linear activation functions. We experimented with sigmoid, hyperbolic tangent and linear; hyperbolic tangent yielded the highest performance. The weights are estimated by minimizing the objective function
(5) |
where is some similarity function. In our experiments we used cosine as similarity function, so that , thus penalizing parameter settings leading to a low cosine between the target linguistic representations and those produced by the projection function . The cosine has been widely used in the distributional semantic literature, and it has been shown to outperform Euclidean distance [6].77We also experimented with the same objective function as Socher et al. (2013), however, our objective function yielded consistently better results in all experimental settings. Parameters were estimated with standard backpropagation and L-BFGS.
Our experiments focus on the tasks of zero-shot learning (Sections 5.1 and 5.2) and fast mapping (Section 5.3). In both tasks, the projected vector of the unseen concept is labeled with the word associated to its cosine-based nearest neighbor vector in the corresponding semantic space.
For the zero-shot task we report the accuracy of retrieving the correct label among the top neighbors from a semantic space populated with the union of seen and unseen concepts. For fast mapping, we report the mean rank of the correct concept among fast mapping candidates.
For this experiment, we use the intersection of our linguistic space with the concepts present in CIFAR-100, containing a total of 90 concepts. For each concept category, we treat all concepts but one as seen concepts (Table 1). The 71 seen concepts correspond to 42,600 distinct visual vectors and are used to induce the projection function. Table 2 reports results obtained by averaging the performance on the 11,400 distinct vectors of the 19 unseen concepts.
Our 4 models introduced in Section 4.4 are compared to a theoretically derived baseline Chance simulating selecting a label at random. For the neural network NN, we use prior knowledge about the number of concept categories to set the number of hidden units to 20 in order to avoid tuning of this parameter. For the SVD model, we set the number of dimensions to 300, a common choice in distributional semantics, coherent with the settings we used for the visual and linguistic spaces.
First and foremost, all 4 models outperform Chance by a large margin. Surprisingly, the very simple lin method outperforms both CCA and SVD. However, NN, an architecture that can capture more complex, non-linear relations in features across modalities, emerges as the best performing model, confirming on a larger scale the recent findings of Socher et al. (2013).
\backslashboxModelk | 1 | 2 | 3 | 5 | 10 | 20 |
---|---|---|---|---|---|---|
Chance | 1.1 | 2.2 | 3.3 | 5.5 | 11.0 | 22.0 |
SVD | 1.9 | 5.0 | 8.1 | 14.5 | 29.0 | 48.6 |
CCA | 3.0 | 6.9 | 10.7 | 17.9 | 31.7 | 51.7 |
lin | 2.4 | 6.4 | 10.5 | 18.7 | 33.0 | 55.0 |
NN | 3.9 | 6.6 | 10.6 | 21.9 | 37.9 | 58.2 |
In order to gain qualitative insights into the performance of the projection process of NN, we attempt to investigate the role and interpretability of the hidden layer. We achieve this by looking at which visual concepts result in the highest hidden unit activation.88For this post-hoc analysis, we include a sparsity parameter in the objective function of Equation 5 in order to get more interpretable results; hidden units are therefore maximally activated by a only few concepts. This is inspired by analogous qualitative analysis conducted in Topic Models [20], where “topics” are interpreted in terms of the words with the highest probability under each of them.
Table 3 presents both seen and unseen concepts corresponding to visual vectors that trigger the highest activation for a subset of hidden units. The table further reports, for each hidden unit, the “correct” unseen concept for the category of the top seen concepts, together with its rank in terms of activation of the unit. The analysis demonstrates that, although prior knowledge about categories was not explicitly used to train the network, the latter induced an organization of concepts into superordinate categories in which the hidden layer acts as a cross-modal concept categorization/organization system. When the induced projection function maps an object onto the linguistic space, the derived text vector will inherit a mixture of textual features from the concepts that activated the same hidden unit as the object. This suggests a bias towards seen concepts. Furthermore, in many cases of miscategorization, the concepts are still semantically coherent with the induced category, confirming that the projection function is indeed capturing a latent, cross-modal semantic space. A squirrel, although not a “large omnivore”, is still an animal, while butterflies are not flowers but often feed on their nectar.
Seen Concepts | Unseen Concept | Rank of Correct | CIFAR-100 Category | |
---|---|---|---|---|
Unseen Concept | ||||
Unit 1 | sunflower, tulip, pear | butterfly | 2 (rose) | flowers |
Unit 2 | cattle, camel, bear | squirrel | 2 (elephant) | large omnivores and herbivores |
Unit 3 | castle, bridge, house | bus | 4 (skyscraper) | large man-made outdoor things |
Unit 4 | man, girl, baby | boy | 1 | people |
Unit 5 | motorcycle, bicycle, tractor | streetcar | 2 (bus) | vehicles 1 |
Unit 6 | sea, plain, cloud | forest | 1 | large natural outdoor scenes |
Unit 7 | chair, couch, table | bed | 1 | household furniture |
Unit 8 | plate, bowl, can | clock | 3 (cup) | food containers |
Unit 9 | apple, pear, mushroom | orange | 1 | fruit and vegetables |
For this experiment, we focus on NN, the best performing model in the previous experiment. We use a set of approximately 9,500 concepts, the intersection of the ESP-based visual semantic space with the linguistic space. For tuning the number of hidden units of NN, we use the MEN-concrete dataset of Bruni et al. (2014). Finally, we randomly pick 70% of the concepts to induce the projection function and report results on the remaining 30%. Note that the search space for the correct label in this experiment is approximately 95 times larger than the one used for the experiment presented in Section 5.1.
Although our experimental setup differs from the one of Frome et al. (2013), thus preventing a direct comparison, the results reported in Table 5 are on a comparable scale to theirs. We note that previous work on zero-shot learning has used standard object recognition benchmarks. To the best of our knowledge, this is the first time this task has been performed on a dataset as noisy as ESP. Overall, the results suggest that cross-modal mapping could be applied in tasks where images exhibit a more complex structure, e.g., caption generation and event recognition.
Unseen Concept | Nearest Neighbors |
---|---|
tiger | cat, microchip, kitten, vet, pet |
bike | spoke, wheel, brake, tyre, motorcycle |
blossom | bud, leaf, jasmine, petal, dandelion |
bakery | quiche, bread, pie, bagel, curry |
\backslashbox[1cm]Modelk | 1 | 2 | 5 | 10 | 50 |
---|---|---|---|---|---|
Chance | 0.01 | 0.02 | 0.05 | 0.10 | 0.5 |
NN | 0.8 | 1.9 | 5.6 | 9.7 | 30.9 |
In this section, we aim at simulating a fast mapping scenario in which the learner has been just exposed to a new concept, and thus has limited linguistic evidence for that concept. We operationalize this by considering the 34 concrete concepts introduced by Frassinelli and Keller (2012), and deriving their text-based representations from just a few sentences randomly picked from the corpus. Concretely, we implement 5 models: context 1, context 5, context 10, context 20 and context full, where the name of the model denotes the number of sentences used to construct the text-based representations. The derived vectors were reduced with the same SVD projection induced from the complete corpus. Cross-modal mapping is done via NN.
The zero-shot framework leads us to frame fast mapping as the task of projecting visual representations of new objects onto language space for retrieving their word labels (). This mapping from visual to textual representations is arguably a more plausible task than vice versa. If we think about how linguistic reference is acquired, a scenario in which a learner first encounters a new object and then seeks its reference in the language of the surrounding environment (e.g., adults having a conversation, the text of a book with an illustration of an unknown object) is very natural. Furthermore, since not all new concepts in the linguistic environment refer to new objects (they might denote abstract concepts or out-of-scene objects), it seems more reasonable for the learner to be more alerted to linguistic cues about a recently-spotted new object than vice versa. Moreover, once the learner observes a new object, she can easily construct a full visual representation for it (and the acquisition literature has shown that humans are wired for good object segmentation and recognition [50]) – the more challenging task is to scan the ongoing and very ambiguous linguistic communication for contexts that might be relevant and informative about the new object. However, fast mapping is often described in the psychological literature as the opposite task: The learner is exposed to a new word in context and has to search for the right object referring to it. We implement this second setup () by training the projection function which maps linguistic vectors to visual ones. The adaptation of NN is straightforward; the new objective function is derived as
(6) |
where , and .
Table 7 presents the results. Not surprisingly, performance increases with the number of sentences that are used to construct the textual representations. Furthermore, all models perform better than Chance, including those that are based on just 1 or 5 sentences. This suggests that the system can make reasonable inferences about object-word connections even when linguistic evidence is very scarce.
cooker | potato | dishwasher | corkscrew |
---|---|---|---|
clarinet | drum | potato | corn |
gorilla | elephant | guitar | violin |
scooter | car | scarf | trouser |
Regarding the sources of error, a qualitative analysis of predicted word labels and objects as presented in Table 6 suggests that both textual and visual representations, although capturing relevant “topical” or “domain” information, are not enough to single out the properties of the target concept. As an example, the textual vector of dishwasher contains kitchen-related dimensions such as fridge, oven, gas, hob, …, sink. After projecting onto the visual space, its nearest visual neighbours are the visual ones of the same-domain concepts corkscrew and kettle. The latter is shown in Figure 5, with a gas hob well in evidence. As a further example, the visual vector for cooker is extracted from pictures such as the one in Figure 5. Not surprisingly, when projecting it onto the linguistic space, the nearest neighbours are other kitchen-related terms, i.e., potato and dishwasher.
\backslashboxContextMapping | ||
---|---|---|
Chance | 17 | 17 |
context 1 | 12.6 | 14.5 |
context 5 | 8.08 | 13.29 |
context 10 | 7.29 | 13.44 |
context 20 | 6.02 | 12.17 |
context full | 5.52 | 5.88 |
.25
{subfigure}.25
At the outset of this work, we considered the problem of linking purely language-based distributional semantic spaces with objects in the visual world by means of cross-modal mapping. We compared recent models for this task both on a benchmark object recognition dataset and on a more realistic and noisier dataset covering a wide range of concepts. The neural network architecture emerged as the best performing approach, and our qualitative analysis revealed that it induced a categorical organization of concepts. Most importantly, our results suggest the viability of cross-modal mapping for grounded word-meaning acquisition in a simulation of fast mapping.
Given the success of NN, we plan to experiment in the future with more sophisticated neural network architectures inspired by recent work in machine translation [19] and multimodal deep learning [51]. Furthermore, we intend to adopt visual attributes [14, 44] as visual representations, since they should allow a better understanding of how cross-modal mapping works, thanks to their linguistic interpretability. The error analysis in Section 5.3 suggests that automated localization techniques [54], distinguishing an object from its surroundings, might drastically improve mapping accuracy. Similarly, in the textual domain, models that extract collocates of a word that are more likely to denote conceptual properties [28] might lead to more informative and discriminative linguistic vectors. Finally, the lack of large child-directed speech corpora constrained the experimental design of fast mapping simulations; we plan to run more realistic experiments with true nonce words and using source corpora (e.g., the Simple Wikipedia, child stories, portions of CHILDES) that contain sentences more akin to those a child might effectively hear or read in her word-learning years.
We thank Adam Liška for helpful discussions and the 3 anonymous reviewers for useful comments. This work was supported by ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES).