Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages

Alan Akbik1, vishwajeet kumar2, Yunyao Li3
1IBM Research, 2Department of Computer Science Engineering, Indian Institute of Technology Bombay, 3IBM Research - Almaden


Abstract

Annotation projection based on parallel corpora has shown great promise in inexpensively creating Proposition Banks for languages for which high-quality parallel corpora and syntactic parsers are available. In this paper, we conduct an experimental study where we apply this approach to three languages that lack such resources: Tamil, Bengali and Malayalam. We find an average quality difference of 6 to 20 absolute F-measure points vis-a-vis high-resource languages, which indicates that annotation projection alone is insufficient in low-resource scenarios. Based on these results, we explore the possibility of using annotation projection as a starting point for inexpensive data curation involving both experts and non-experts. We give an outline of what such a process may look like and present an initial study to discuss its potential and challenges.