Optimal Data Set Selection: An Application to Grapheme-to-Phoneme Conversion
Young-Bum Kim and Benjamin Snyder
In this paper we introduce the task of unlabeled, optimal, data set
selection. Given a large pool of unlabeled examples, our goal is to select
a small subset to label, which will yield a high performance supervised
model over the entire data set. Our first proposed method, based on the
rank-revealing QR matrix factorization, selects a subset of words which
span the entire word-space effectively. For our second method, we develop
the concept of feature coverage which we optimize with a greedy algorithm.
We
apply these methods to the task of grapheme-to-phoneme prediction.
Experiments over a data-set of 8 languages show that in all scenarios,
our selection methods are effective at yielding a small, but
optimal set of labelled examples. When fed into a state-of-the-art
supervised model for grapheme-to-phoneme prediction, our methods yield
average error reductions of 20\% over randomly selected examples.
Back to Papers Accepted