IRT-based Aggregation Model of Crowdsourced Pairwise Comparison for Evaluating Machine Translations

Naoki Otani1, Toshiaki Nakazawa2, Daisuke Kawahara1, Sadao Kurohashi1
1Kyoto University, 2Japan Science and Technology Agency


Abstract

Recent work on machine translation has used crowdsourcing to reduce costs of manual evaluations. However, crowdsourced judgments are often biased and inaccurate. In this paper, we present a statistical model that aggregates many manual pairwise comparisons to robustly measure a machine translation system's performance. Our method applies graded response model from item response theory (IRT), which was originally developed for academic tests. We conducted experiments on a public dataset from the Workshop on Statistical Machine Translation 2013, and found that our approach resulted in highly interpretable estimates and was less affected by noisy judges than previously proposed methods.