Learning Whom to Trust with MACE
Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani and Eduard Hovy
Non-expert annotation services like Amazon’s Mechanical Turk (AMT) are cheap
and fast ways to evaluate systems and provide categorical annotations for
training data. Unfortunately, some annotators choose bad labels in order to
maximize their pay. Manual identification is tedious, so we experiment with an
item-response model. It learns in an unsupervised fashion to a) identify which
annotators are trustworthy and b) predict the correct underlying labels. We
match performance of more complex state-of-the-art systems and perform well
even under adversarial conditions. We show considerable improvements over
standard baselines, both for predicted label accuracy and trustworthiness
estimates. The latter can be further improved by introducing a prior on model
parameters and using Variational Bayes inference. Additionally, we can achieve
even higher accuracy by focusing on the instances our model is most confident
in (trading in some recall), and by incorporating annotated control instances.
Our system, MACE (Multi-Annotator Competence Estimation), is available for
download.
Back to Papers Accepted