Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter
Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson and David Yarowsky
Hidden properties of social media users, such as their ethnicity, gender, and
location, are often reflected in their observed attributes, such as their first
and last names. Furthermore, users who communicate with each other often have
similar hidden properties. We propose an algorithm that exploits these insights
to cluster the observed attributes of hundreds of millions of Twitter users.
Attributes such as user names are grouped together if users with those names
communicate with other similar users. We separately cluster millions of unique
first names, last names, and user-provided locations. The efficacy of these
clusters is then evaluated on a diverse set of classification tasks that
predict hidden users properties such as ethnicity, geographic location, gender,
language, and race, using only profile names and locations when appropriate.
Our readily-replicable approach and publicly-released clusters are shown to be
remarkably effective and versatile, substantially outperforming
state-of-the-art approaches and human accuracy on each of the tasks studied.
Back to Papers Accepted