Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods
Ben King and Steven Abney
In this paper we consider the problem of labeling the languages of words in
mixed-language documents. This problem is approached in a weakly supervised
fashion, as a sequence labeling problem with monolingual text samples for
training data. Among the approaches evaluated, a conditional random field
model trained with generalized expectation criteria was the most accurate and
performed consistently as the amount of training data was varied.
Back to Papers Accepted