Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics

Douwe Kiela1, Anita Lilla VerÅ‘2, Stephen Clark2
1University of Cambridge Computer Laboratory, 2University of Cambridge


Abstract

Multi-modal distributional models learn grounded representations for improved performance in semantics tasks. Deep visual representations, learned using convolutional neural networks, have been shown to achieve particularly high performance. In this study, we systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures. In addition, we explore the various data sources that can be used for retrieving relevant images, showing that images from search engines perform as well as, or better than, manually crafted resources such as ImageNet. Furthermore, we explore the optimal number of images and the multi-lingual applicability of multi-modal semantics. We hope that these findings can serve as a guide for future research in the field.