Detecting Visual Text

Jesse Dodge1,  Amit Goyal2,  Xufeng Han3,  Alyssa Mensch4,  Margaret Mitchell5,  Karl Stratos6,  Kota Yamaguchi3,  Yejin Choi3,  Hal Daume III2,  Alex Berg3,  Tamara Berg3
1UW, 2UMD, 3SBU, 4MIT, 5Aberdeen, 6Columbia


Abstract

When people describe a scene, they often include information that is not visually apparent; sometimes based on background knowledge, sometimes to tell a story. We aim to separate visual text---descriptions of what is being seen---from non-visual text in natural images and their descriptions. To do so, we first concretely define what it means to be visual, annotate visual text and then develop algorithms to automatically classify noun phrases as visual or non-visual. We find that using text alone, we are able to achieve high accuracies at this task, and that incorporating features derived from computer vision algorithms improves performance. Finally, we show that we can reliably mine visual nouns and adjectives from large corpora and that we can use these effectively in the classification task.