Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods

Ben King and Steven Abney

In this paper we consider the problem of labeling the languages of words in mixed-language documents. This problem is approached in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data. Among the approaches evaluated, a conditional random field model trained with generalized expectation criteria was the most accurate and performed consistently as the amount of training data was varied.

Back to Papers Accepted