Existing models for social media personal analytics assume access to thousands of messages per user, even though most users author content only sporadically over time. Given this sparsity, we: (i) leverage content from the local neighborhood of a user; (ii) evaluate batch models as a function of size and the amount of messages in various types of neighborhoods; and (iii) estimate the amount of time and tweets required for a dynamic model to predict user preferences. We show that even when limited or no self-authored data is available, language from friend, retweet and user mention communications provide sufficient evidence for prediction. When updating models over time based on Twitter, we find that political preference can be often be predicted using roughly 100 tweets, depending on the context of user selection, where this could mean hours, or weeks, based on the author’s tweeting frequency.
Inferring latent user attributes such as gender, age, and political preferences [30, 42, 6] automatically from personal communications and social media including emails, blog posts or public discussions has become increasingly popular with the web getting more social and volume of data available. Resources like Twitter11http://www.demographicspro.com/ or Facebook22http://www.wolframalpha.com/facebook/ become extremely valuable for studying the underlying properties of such informal communications because of its volume, dynamic nature, and diverse population [18, 33].
The existing batch models for predicting latent user attributes rely on thousands of tweets per author [31, 7, 27, 5, 42, 21]. However, most Twitter users are less prolific than those examined in these works, and thus do not produce the thousands of tweets required to obtain their levels of accuracy e.g., the median number of tweets produced by a random Twitter user per day is 10. Moreover, recent changes to Twitter API querying rates further restrict the speed of access to this resource, effectively reducing the amount of data that can be collected in a given time period.
In this paper we analyze and go beyond static models formulating personal analytics in social media as a streaming task. We first evaluate batch models that are cognizant of low-resource prediction setting described above, maximizing the efficiency of content in calculating personal analytics. To the best of our knowledge, this is the first work that makes explicit the tradeoff between accuracy and cost (manifest as calls to the Twitter API), and optimizes to a different tradeoff than state-of-the-art approaches, seeking maximal performance when limited data is available. In addition, we propose streaming models for personal analytics that dynamically update user labels based on their stream of communications which has been addressed previously by Van Durme (2012b). Such models better capture the real-time nature of evidence being used in latent author attribute predictions tasks. Our main contributions include:
develop low-resource and real-time dynamic approaches for personal analytics using as an example the prediction of political preference of Twitter users;
examine the relative utility of six different notions of “similarity” between users in an implicit Twitter social network for personal analytics;
experiments are performed across multiple datasets supporting the prediction of political preference in Twitter, to highlight the significant differences in performance that arise from the underlying collection and annotation strategies.
Twitter users interact with one another and engage in direct communication in different ways e.g., using retweets, user mentions e.g., @youtube or hashtags e.g., #tcot, in addition to having explicit connections among themselves such as following, friending. To investigate all types of social relationships between Twitter users and construct Twitter social graphs we collect lists of followers and friends, and extract user mentions, hashtags, replies and retweets from communications.33The code and detailed explanation on how we collected all six types of user neighbors and their communications using Twitter API can be found here: http://www.cs.jhu.edu/ svitlana/
Lets define an attributed, undirected graph , where is a set of vertices and is a set of edges. Each vertex represents someone in a communication graph i.e., communicant: here a Twitter user. Each vertex is attributed with a feature vector which encodes communications e.g., tweets available for a given user. Each vertex is associated with a latent attribute , in our case it is binary , where stands for Democratic and for Republican users. Each edge represents a connection between and , and defines different social circles between Twitter users e.g., follower , friend , user mention , hashtag , reply and retweet . Thus, . We denote a set of edges of a given type as for . We denote a set of vertices adjacent to by social circle type as which is equivalent to . Following Filippova (2012) we refer to as ’s social circle, otherwise known as a neighborhood. In most cases, we only work with a sample of a social circle, denoted by where is its size for .
Figure 1 presents an example of a social graph derived from Twitter. Notably, users from different social circles can be shared across the users of the same or different classes e.g., a user can be in both follower circle and retweet circle .
We construct candidate-centric graph by looking into following relationships between the users and Democratic or Republican candidates during the 2012 US Presidential election. In the Fall of 2012, leading up to the elections, we randomly sampled Democratic and Republican users. We labeled users as Democratic if they exclusively follow both Democratic candidates44As of Oct 12, 2012, the number of followers for Obama, Biden, Romney and Ryan were 2m, 168k, 1.3m and 267k. – BarackObama and JoeBiden but do not follow both Republican candidates – MittRomney and RepPaulRyan and vice versa. We collectively refer to and as our “users of interest” for which we aim to predict political preference. For each such user we collect recent tweets and randomly sample their immediate neighbors from follower, friend, user mention, reply, retweet and hashtag social circles.
We construct a geo-centric graph by collecting Democratic and Republican users from the Maryland, Virginia and Delaware region of the US with self-reported political preference in their biographies. Similar to the candidate-centric graph, for each user we collect recent tweets and randomly sample user social circles in the Fall of 2012. We collect this data to get a sample of politically less active users compared to the users from candidate-centric graph.
We also consider a graph constructed from a dataset previously used for political affiliation classification [42]. This dataset consists of 200 Republican and 200 Democratic users associated with 925 tweets on average per user.55The original dataset was collected in 2012 and has been recently released at http://icwsm.cs.mcgill.ca/. Political labels are extracted from http://www.wefollow.com as described by Pennacchiotti and Popescu (2011a). Each user has on average 6155 friends with 642 tweets per friend. Sharing restrictions and rate limits on Twitter data collection only allowed us to recreate a semblance of ZLR data66This inability to perfectly replicate prior work based on Twitter is a recognized problem throughout the community of computational social science, arising from the data policies of Twitter itself, it is not specific to this work. – 193 Democratic and 178 Republican users with 1K tweets per user, and 20 neighbors of four types including follower, friends, user mention and retweet with 200 tweets per neighbor for each user of interest.
Baseline User Model As input we are given a set of vertices representing users of interest along with feature vectors derived from content authored by the user of interest. Each user is associated with a non-zero number of publicly posted tweets. Our goal is assign to a category each user of interest based on . Here we focus on a binary assignment into the categories Democratic or Republican . The log-linear model77We use log-linear models over reasonable alternatives such as perceptron or SVM, following the practice of a wide range of previous work in related areas [34, 17, 29] including text classification in social media [38, 39]. for such binary classification is:
(1) |
where features are normalized word ngram counts extracted from ’s tweets .
The proposed baseline model follows the same trends as the existing state-of-the-art approaches for user attribute classification in social media as described in Section 8. Next we propose to extend the baseline model by taking advantage of language in user social circles as describe below.
Neighbor Model As input we are given user-local neighborhood , where is a neighborhood type. Besides the neighborhood’s type , each is characterized by:
the number of communications per neighbor , ;
the order of the social circle – the number of neighbors per user of interest , .
Our goal is to classify users of interest using evidence (e.g., communications) from their local neighborhood as Democratic or Republican. The corresponding log-linear model is defined as:
(2) |
To check whether our static models are cognizant of low-resource prediction settings we compare the performance of the user model from Eq.1 and the neighborhood model from Eq.2. Following the streaming nature of social media, we see the scarce available resource as the number of requests allowed per day to the Twitter API. Here we abstract this to a model assumption where we receive one tweet at a time and aim to maximize classification performance with as few tweets per user as possible:88The separate issue is that many authors simply don’t tweet very often. For instance, 85.3% of all Twitter users post less than one update per day as reported at http://www.sysomos.com/insidetwitter/. Thus, their communications are scare even if we could get all of them without rate limiting from Twitter API.
for the baseline user model:
(3) |
for the neighborhood model:
(4) |
We rely on straightforward Bayesian rule update to our batch models in order to simulate a real-time streaming prediction scenario as a first step beyond the existing models as shown in Figure 2.
The model makes predictions of a latent user attribute e.g., Republican under a model assumption of sequentially arriving, independent and identically distributed observations 99Given the dynamic character of online discourse it will clearly be of interest in the future to consider models that go beyond the iid assumption.. The model dynamically updates posterior probability estimates for a given user as an additional evidence is acquired, as defined in a general form below for any latent attribute given the tweets of user :
(5) | ||||
where is the number of all possible attribute values, and is the number of tweets per user.
For example, to predict user political preference, we start with a prior , and sequentially update the posterior by accumulating evidence from the likelihood :
(6) | ||||
Our goal is to maximize posterior probability estimates given a stream of communications for each user in the data over (a) time and (b) the number of tweets . For that, for each user we take tweets that arrive continuously over time and apply two different streaming models:
User Model with Dynamic Updates: relies exclusively on user tweets following the order they arrive over time , where for each user we dynamically update the posterior .
User-Neighbor Model with Dynamic Updates: relies on both neighbor communications including friend, follower, retweet, user mention and user tweets following the order they arrive over time ; here we dynamically update the posterior probability .
We design a set of experiments to analyze static and dynamic models for political affiliation classification defined in Sections 3 and 4.
We first answer whether communications from user-local neighborhoods can help predict political preference for the user. To explore the contribution of different neighborhood types we learn static user and neighbor models on , and graphs. We also examine the ability of our static models to predict user political preferences in low-resource setting e.g., 5 tweets.
The existing models follow a standard setup when either user or neighbor tweets are available during train and test. For a static neighbor model we go beyond that, and train our the model on all data available per user, but only apply part of the data at the test time, pushing the boundaries of how little is truly required for classification. For example, we only use follower tweets for , but we use tweets from all types of neighbors for . Such setup will simulate different real-world prediction scenarios which have not been previously explored, to our knowledge e.g., when a user has a private profile or has not tweeted yet, and only user neighbor tweets are available.
We experiment with our static neighbor model defined in Eq.2 with the aim to:
evaluate neighborhood size influence, we change the number of neighbors and try neighbor(s) per user;
estimate neighbor content influence, we alternate the amount of content per neighbor and try tweets.
We perform 10-fold cross validation1010For each fold we split the data into 3 parts: 70% train, 10% development and 20% test. and run 100 random restarts for every and parameter combination. We compare our static neighbor and user models using the cost functions from Eq.3 and Eq.4. For all experiments we use LibLinear [9], integrated in the Jerboa toolkit [37]. Both models defined in Eq.1 and Eq.2 are learned using normalized count-based word ngram features extracted from either user or neighbor tweets.1111For brevity we omit reporting results for bigram and trigram features, since unigrams showed superior performance.
We evaluate our models with dynamic Bayesian updates on a continuous stream of communications over time as shown in Figure 2. Unlike static model experiments, we are not modeling the influence of the number of neighbors or the amount of content per neighbor. Here, we order user and neighbor communication streams by real world time of posting and measure changes in posterior probabilities over time. The main purpose of these experiments is to quantitatively evaluate (1) the number of tweets and (2) the amount of real world time it takes to observe enough evidence on Twitter to make reliable predictions.
We experiment with log-linear models defined in Eq. 1 and 2 and continuously estimate the posterior probabilities as defined in Eq.6. We average the posterior probability results over the users in , and graphs. We train streaming models on an attribute balanced subset of tweets for each user excluding ’s tweets (or ’s neighbor tweets for a joint model). This setup is similar to leave-one-out classification. The classifier is learned using binary word ngram features extracted from user or user-neighbor communications. We prefer binary to normalized count-based features to overcome sparsity issues caused by making predictions on each tweet individually.
We investigate classification decision probabilities for our static user model by making predictions on a random set of 5 vs. 100 tweets per user. To our knowledge only limited work on personal analytics [5, 38] have performed this straight-forward comparison. For that purpose, we take a random partition containing 100 users of graph and perform four independent classification experiments – two runs using 5 and two runs using 100 tweets per user.
Figure 3 demonstrates that more tweets during prediction time lead to higher accuracy by showing that more users with 100 tweets are correctly classified e.g., filled green markers in the right upper quadrant are true Republicans and in the left lower quadrant are true Democrats. Moreover, a lot of users with 100 tweets are close to 0.5 decision probability which suggests that the classifier is just uncertain rather then being completely off, e.g., misclassified Republican users with 5 tweets (not filled blue markers in the right lower quadrant) are close to 0. These results follow naturally from the underlying feature representation: having more tweets per user leads to a lower variance estimate of a target multinomial distribution. The more robustly this distribution is estimated (based on having more tweets) the more confident we should be in the classifier output.
[: 2 neighbors]
\subfloat[: 10 neighbors]
\subfloat[: 2 neighbors]
\subfloat[: 10 neighbors]
Here we discuss the results for our static neighborhood model. We study the influence of the neighborhood type and size in terms of the number of neighbors and tweets per neighbor.
In Figure 4 we present accuracy results for and graphs. Following Eq.3 and 4, we spent an equal amount of resources to obtain 100 user tweets and 10 tweets from 10 neighbors. We annotate these ‘points of equal number of communications’ with a line on top marked with a corresponding number of user tweets.
[: 5 tweets]
\subfloat[: 200 tweets]
\subfloat[: 5 tweets]
\subfloat[: 200 tweets]
We show that three of six social circles – friend, retweet and user-mention yield better accuracy compared to the user model for all graphs when . Thus, for effectively classifying a given user it is better to take 200 tweets each from 10 neighbors rather than 2,000 tweets from the user.
The best accuracy for is 0.75 for friend, follower, retweet and user-mention neighborhoods which is 0.03 higher than the user baseline; for is 0.67 for user-mention and 0.64 for retweet circles compared to 0.57 for the user model; for is 0.863 for retweet and 0.849 for friend circles which is 0.11 higher that the user baseline. Finally, similarly to the results for the user model given in Figure 3, increasing the number of tweets per neighbor from 5 to 200 leads to a significant gain in performance for all neighborhood types.
In Figure 5 we present accuracy results to show neighborhood size influence on classification performance for and graphs. Our results demonstrate that even small changes to the neighborhood size lead to better performance which does not support the claims by Zamal et al. (2012). We demonstrate that increasing the size of the neighborhood leads to better performance across six neighborhood types. Friend, user mention and retweet neighborhoods yield the highest accuracy for all graphs. We observe that when the number of neighbors is , the difference in accuracy across all neighborhood types is less significant but for it becomes more significant.
Figures 6 and 6 demonstrate dynamic user model prediction results averaged over users from and graphs. Each figure outlines changes in sequential average probability estimates for each individual self-authored tweet as defined in Eq. 6. The average probability estimates are reported for every 5 tweets in a stream as , where is the total number of users with the same attribute or . We represent as a box and whisker plot with the median, lower and upper quantiles to show the variance; the length of whiskers indicate lower and upper extreme values.
[User ] \subfloat[User ]
We find similar behavior across all three graphs. In particular, the posterior estimates converge faster when predicting Democratic than Republican users but it has been trained on an equal number of tweets per class. We observe that average posterior estimates converge faster to 0 (Democratic) than to 1 (Republican) in Figures 6 and 6. It suggests that language of Democrats is more expressive of their political preference than language of Republicans. For example, frequent politically influenced terms used widely by Democratic users include faith4liberty, constitutionally, pass, vote2012, terroristic.
The variance for average posterior estimates decreases when the number of tweets increases for all three datasets. Moreover, we detect that estimates for users in converge 2-3 times faster in terms of number of tweets than for users in . The lowest convergence is detected for where after tweets the average posterior estimate and . It means that users in are more politically vocal compared to users in and . As a result, less active users in just need more than 250 tweets to converge to a true 0 or 1 class. These results are coherent with the outcomes for our static models shown in Figures 4 and 5. These findings further confirm that differences in performance are caused by various biases present in the data due to distinct sampling and annotation approaches.
[User ]
\subfloat[User ]
\subfloat[User-Neigh ]
\subfloat[User-Neigh ]
Figure 7a and 7b illustrate the amount of time required for the user model to infer political preferences estimated for 1,031 users in and 371 users in . The amount of time needed can be evaluated for different accuracy levels e.g., 0.75 and 0.95. Thus, with 75% accuracy we classify:
100 (20%) Republican users in 3.6 hours and Democratic users in 2.2 hours for ;
100 (56%) users in 20 weeks and 100 (52%) users in 8.9 weeks for which is 800 times longer that for ;
100 (75%) users in 12 weeks and 80 (60%) users in 19 weeks for .
Such extreme divergences in the amount of time required for classification across all graphs should be of strong interest to researchers concerned with latent attribute prediction tasks because Twitter users produce messages with extremely different frequencies. In our case, users in tweet approximately 800 times less frequently than users in .
We estimate dynamic posterior updates from a joint stream of user and neighbor communications in , and graphs. To make a fair comparison with a streaming user model, we start with the same user tweet . Then instead of waiting for the next user tweet we rely on any neighbor tweets that appear until the user produces the next tweet . We rely on communications from four types of neighbors such as friends, followers, retweets and user mentions.
The convergence rate for the average posterior probability estimates depending on the number of tweets is similar to the user model results presented in Figure 6. However, for the variance for is higher for Democratic users; for for Republicans in less than 110 tweets which is tweets faster than the user model; for the convergence for both and is not significantly different than the user model.
Figures 7c and 7d show the amount of time required for a joint user-neighbor model to infer political preferences estimated for users in and . We find that with 75% accuracy we can classify 100 users for:
: Republican users in 23 minutes and Democratic users in 10 minutes;
: users in 3.2 weeks and users in 1.1 weeks which is 7 times faster on average across attributes than for the user model;
: users in 1.2 weeks and users in 3.5 weeks which is on average 6 times faster across attributes than for the user model.
Similar or better convergence in terms of the number of tweets and, especially, in the amount of time needed for user and user-neighbor models further confirms that neighborhood content is useful for political preference prediction. Moreover, communications from a joint stream allow to make an inference up to 7 times faster.
Supervised Batch Approaches The vast majority of work on predicting latent user attributes in social media apply supervised static SVM models for discrete categorical e.g., gender and regression models for continuous attributes e.g., age with lexical bag-of-word features for classifying user gender [11, 31, 5, 38], age [31, 22, 21] or political orientation. We present an overview of the existing models for political preference prediction in Table 1.
Bergsma et al. [2] following up on Rao’s work [31] on adding socio-linguistic features to improve gender, ethnicity and political preference prediction show that incorporating stylistic and syntactic information to the bag-of-word features improves gender classification.
\arraybackslash Approach | \arraybackslashUsers | \arraybackslashTweets | \arraybackslashFeatures | \arraybackslashAccur. |
\arraybackslashRao et al. (2010) | \arraybackslash1K | \arraybackslash2M | \arraybackslashngrams socio-ling stacked | \arraybackslash0.824 0.634 0.809 |
\arraybackslashPennacchiotti and Popescu (2011b) | \arraybackslash10.3K | \arraybackslash– | \arraybackslashling-all soc-all full | \arraybackslash0.770 0.863 0.889 |
\arraybackslashConover et al. (2011) | \arraybackslash1,000 | \arraybackslash1M | \arraybackslashfull-text hashtags clusters | \arraybackslash0.7920.908 0.949 |
\arraybackslashZamal et al. (2012) | \arraybackslash400 | \arraybackslash400K 3.85M 4.25M | \arraybackslashUserOnly Nbr User-Nbr | \arraybackslash0.890 0.920 0.932 |
\arraybackslashCohen and Ruths (2013) | \arraybackslash397 1.8K 262 196 | \arraybackslash397K 1.8M 262K 196K | \arraybackslashfeatures from [42] | \arraybackslash0.910 0.840 0.680 0.870 |
\arraybackslashThis paper (batch classification) | \arraybackslash 1,031 270 371 | \arraybackslash206K 2M 54K 540K 371K 1.5M | \arraybackslashuser ngrams neighbor user ngrams neighbor user ngrams neighbor | \arraybackslash0.720 0.750 0.570 0.670 0.886 0.920 |
\arraybackslashThis paper (dynamic Bayesian update classification) | \arraybackslash 1,031 270 371 | \arraybackslash103K 130K 54K 67K 74K 185K | \arraybackslashuser stream user-neigh. user stream user-neigh. user stream user-neigh. | \arraybackslash0.995 0.999 0.843 0.882 0.892 0.999 |
Other methods characterize Twitter users by applying limited amounts of network structure information in addition to lexical features. Connover et al. [7] rely on identifying strong partisan clusters of Democratic and Republican users in a Twitter network based on retweet and user mention degree of connectivity, and then combine this clustering information with the follower and friend neighborhood size features. Pennacchiotti et al. [27, 26] focus on user behavior, network structure and linguistic features. Similar to our work, they assume that users from a particular class tend to reply and retweet messages of the users from the same class. We extend this assumption and study other relationship types e.g., friends, user mentions etc. Recent work by Wong et al. [20] investigates tweeting and retweeting behavior for political learning during 2012 US Presidential election. The most similar work to ours is by Zamal et al. (2012), where the authors apply features from the tweets authored by a user’s friend to infer attributes of that user. In this paper, we study different types of user social circles in addition to a friend network.
Additionally, using social media for mining political opinions [23, 19] or understanding socio-political trends and voting outcomes [36, 12, 15] is becoming a common practice. For instance, Lampos et al. [15] propose a bilinear user-centric model for predicting voting intentions in the UK and Australia from social media data. Other works explore political blogs to predict what content will get the most comments [41] or analyze communications from Capitol Hill1212http://www.tweetcongress.org to predict campaign contributors based on this content [40].
Unsupervised Batch Approaches Bergsma et al. [1] show that large-scale clustering of user names improves gender, ethnicity and location classification on Twitter. O’Connor et al. [24] following the work by Eisenstein [8] propose a Bayesian generative model to discover demographic language variations in Twitter. Rao et al. [30] suggest a hierarchical Bayesian model which takes advantage of user name morphology for predicting user gender and ethnicity. Golbeck et al. [13] incorporate Twitter data in a spatial model of political ideology.
Streaming Approaches Van Durme (2012b) proposed streaming models to predict user gender in Twitter. Other works suggested to process text streams for a variety of NLP tasks e.g., real-time opinion mining and sentiment analysis in social media [25], named entity disambiguation [32], statistical machine translation [16], first story detection [28], and unsupervised dependency parsing [14]. Massive Online Analysis (MOA) toolkit developed by Bifet et al. (2010) is an alternative to the Jerboa package used in this work developed by Van Durme (2012a). MOA has been effectively used to detect sentiment changes in Twitter streams [4].
In this paper, we extensively examined state-of-the-art static approaches and proposed novel models with dynamic Bayesian updates for streaming personal analytics on Twitter. Because our streaming models rely on communications from Twitter users and content from various notions of user-local neighborhood they can be effectively applied to real-time dynamic data streams. Our results support several key findings listed below.
Neighborhood content is useful for personal analytics. Content extracted from various notions of a user-local neighborhood can be as effective or more effective for political preference classification than user self-authored content. This may be an effect of ‘sparseness’ of relevant user data, in that users talk about politics very sporadically compared to a random sample of their neighbors.
Substantial signal for political preference prediction is distributed in the neighborhood. Querying for more neighbors per user is more beneficial than querying for extra content from the existing neighbors e.g., 5 tweets from 10 neighbors leads to higher accuracy than 25 tweets from 2 neighbors or 50 tweets from 1 neighbor. This may be also the effect of data heterogeneity in social media compared to e.g., political debate text [35]. These findings demonstrate that a substantial signal is distributed over the neighborhood content.
Neighborhoods constructed from friend, user mention and retweet relationships are most effective. Friend, user mention and retweet neighborhoods show the best accuracy for predicting political preferences of Twitter users. We think that friend relationships are more effective than e.g., follower relationships because it is very likely that users share common interests and preferences with their friends, e.g. Facebook friends can even be used to predict a user’s credit score.1313http://money.cnn.com/2013/08/26/technology/social/
facebook-credit-score/ User mentions and retweets are two primary ways of interaction on Twitter. They both allow to share information e.g., political news, events with others and to be involved in direct communication e.g., live political discussions, political groups.
Streaming models are more effective than batch models for personal analytics. The predictions made using dynamic models with Bayesian updates over user and joint user-neighbor communication streams demonstrate higher performance with lower resources spent compared to the batch models. Depending on user political involvement, expressiveness and activeness, the perfect prediction (approaching 100% accuracy) can be made using only 100 - 500 tweets per user.
Generalization of the classifiers for political preference prediction. This work raises a very important but under-explored problem of the generalization of classifiers for personal analytics in social media, also recently discussed by Cohen and Ruth [6]. For instance, the existing models developed for political preference prediction are all trained on Twitter data but report significantly different results even for the same baseline models trained using bag-of-word lexical features as shown in Table 1. In this work we experiment with three different datasets. Our results for both static and dynamic models show that the accuracy indeed depends on the way the data was constructed. Therefore, publicly available datasets need to be released for a meaningful comparison of the approaches for personal analytics in social media.
In future work, we plan to incorporate iterative model updates from newly classified communications similar to online perceptron-style updates. In addition, we aim to experiment with neighborhood-specific classifiers applied towards the tweets from neighborhood-specific streams e.g., friend classifier used for friend tweets, retweet classifier applied to retweet tweets etc.
The authors would like to thank the anonymous reviewers for their helpful comments.