Sort Story: Sorting Jumbled Images and Captions into Stories

Harsh Agrawal1, Arjun Chandrasekaran2, Dhruv Batra1, Devi Parikh3, Mohit Bansal4
1Virginia Tech, 2Virginia Tech, Toyota Technological Institute, 3Georgia Institute of Technology, 4University of North Carolina at Chapel Hill


Abstract

Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. We use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.