What to do about bad language on the internet
Jacob Eisenstein
The rise of social media has brought computational linguistics in ever-closer
contact with bad language: text that defies our expectations about vocabulary,
spelling, and syntax. This paper surveys the landscape of bad language, and
offers a critical review of the NLP community’s response, which has largely
followed two paths: normalization and domain adaptation. Each approach is
evaluated in the context of theoretical and empirical work on computer-mediated
communication. In addition, the paper presents a quantitative analysis of the
lexical diversity of social media text, and its relationship to other corpora.
Back to Papers Accepted