Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith
We consider the problem of part-of-speech tagging for informal, online
conversational text. We systematically evaluate the use of large-scale
unsupervised word clustering and new lexical features to improve tagging
accuracy. With these features, our system achieves state-of-the-art tagging
results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved
from 90% to 93% accuracy (more than 3% absolute). Qualitative analysis of
these word clusters yields insights about NLP and linguistic phenomena in this
genre. Additionally, we contribute the first POS
annotation
guidelines
for
such text and release a new dataset of English language tweets annotated using
these guidelines.
Tagging software, annotation guidelines, and large-scale word clusters are
available at: http://www.ark.cs.cmu.edu/TweetNLP
This paper describes release 0.3 of the "CMU Twitter Part-of-Speech Tagger" and
annotated data.
Back to Papers Accepted