With the proliferation of social media sites, social streams have proven to contain the most up-to-date information on current events. Therefore, it is crucial to extract events from the social streams such as tweets. However, it is not straightforward to adapt the existing event extraction systems since texts in social media are fragmented and noisy. In this paper we propose a simple and yet effective Bayesian model, called Latent Event Model (LEM), to extract structured representation of events from social media. LEM is fully unsupervised and does not require annotated data for training. We evaluate LEM on a Twitter corpus. Experimental results show that the proposed model achieves 83% in F-measure, and outperforms the state-of-the-art baseline by over 7%.
[itemize]itemsep=0cm
Event extraction is to automatically identify events from text with information about what happened, when, where, to whom, and why. Previous work in event extraction has focused largely on news articles, as the newswire texts have been the best source of information on current events [6]. Approaches for event extraction include knowledge-based [12, 15], data-driven [11] and a combination of the above two categories [5]. Knowledge-based approaches often rely on linguistic and lexicographic patterns which represent expert domain knowledge for particular event types. They lack the flexibility of porting to new domains since extraction patterns often need to be re-defined. Data-driven approaches require large annotated data to train statistical models that approximate linguistic phenomena. Nevertheless, it is expensive to obtain annotated data in practice.
With the increasing popularity of social media, social networking sites such as Twitter have become an important source of event information. As reported in [10], even 1% of the public stream of Twitter contains around 95% of all the events reported in the newswire. Nevertheless, the social stream data such as Twitter data pose new challenges. Social media messages are often short and evolve rapidly over time. As such, it is not possible to know the event types a priori and hence violates the use of existing event extraction approaches.
Approaches to event extraction from Twitter make use of a graphical model to extract canonical entertainment events from tweets by aggregating information across multiple messages [1]. In [7], social events involving two persons are extracted from multiple similar tweets using a factor graph by harvesting the redundancy in tweets. Ritter et al. [14] presented a system called TwiCal which extracts an open-domain calendar of significant events represented by a 4-tuple set including a named entity, event phrase, calendar date, and event type from Twitter.
In our work here, we notice a very important property in social media data that the same event could be referenced by high volume messages. This property allows us resort to statistical models that can group similar events based on the co-occurrence patterns of their event elements. Here, event elements include named entities such as person, company, organization, date/time, location, and the relations among them. We can treat an event as a latent variable and model the generation of an event as a joint distribution of its individual event elements. We thus propose a Latent Event Model (LEM) which can automatically detect events from social media without the use of labeled data.
Our work is similar to TwiCal in the sense that we also focus on the extraction of structured representation of events from Twitter. However, TwiCal relies on a supervised sequence labeler trained on tweets annotated with event mentions for the identification of event-related phrases. We propose a simple Bayesian modelling approach which is able to directly extract event-related keywords from tweets without supervised learning. Also, TwiCal uses test to choose an entity with the strongest association with a date to form a binary tuple to represent an event. On the contrary, the structured representation of events can be directly extracted from the output of our LEM model. We have conducted experiments on a Twitter corpus and the results show that our proposed approach outperforms TwiCal, the state-of-the-art open event extraction system, by 7.7% in F-measure.
Events extracted in our proposed framework are represented as a 4-tuple , where stands for a non-location named entity, for a date, for a location, and for an event-related keyword. Each event mentioned in tweets can be closely depicted by this representation. It should be noted that for some events, one or more elements in their corresponding tuples might be absent as their related information is not available in tweets. As illustrated in Figure 1, our proposed framework consists of three main steps, pre-processing, event extraction based on the LEM model and post-processing. The details of our proposed framework are described below.
Tweets are pre-processed by time expression recognition, named entity recognition, POS tagging and stemming.
Twitter users might represent the same date in various forms. For example, “tomorrow”, “next Monday”, “ August 23th” in tweets might all refer to the same day, depending on the date that users wrote the tweets. To resolve the ambiguity of the time expressions, SUTime11http://nlp.stanford.edu/software/sutime.shtml [2] is employed, which takes text and a reference date as input and outputs a more accurate date which the time expression refers to.
Named entity recognition (NER) is a crucial step since the results would directly impact the final extracted 4-tuple . It is not easy to accurately identify named entities in the Twitter data since tweets contain a lot of misspellings and abbreviations. However, it is often observed that events mentioned in tweets are also reported in news articles in the same period [10]. Therefore, named entities mentioned in tweets are likely to appear in news articles as well. We thus perform named entity recognition in the following way. First, a traditional NER tool such as the Stanford Named Entity Recognizer22http://nlp.stanford.edu/software/CRF-NER.shtml is used to identify named entities from the news articles crawled from BBC and CNN during the same period that the tweets were published. The recognised named entities from news are then used to build a dictionary. Named entities from tweets are extracted by looking up the dictionary through fuzzy matching. We have also used a named entity tagger trained specifically on the Twitter data33http://github.com/aritter/twitter-nlp [13] to directly extract named entities from tweets. However, as will be shown in Section 3 that using our constructed dictionary for named entity extraction gives better results. We distinguish between location entities, denoted as , and non-location entities such as person or organization, denoted as .
Finally, we use a POS tagger44http://www.ark.cs.cmu.edu/TweetNLP trained on tweets [3] to perform POS tagging on the tweets data and apart from the previously recognised named entities, only words tagged with nouns, verbs or adjectives are kept. These remaining words are subsequently stemmed and words occurred less than 3 times are filtered.
After the pre-processing step, non-location entities , locations , dates and candidate keywords of the tweets are collected as the input to the LEM model for event extraction.
We propose an unsupervised latent variable model, called the Latent Event Model (LEM), to extract events from tweets. The graphical model of LEM is shown in Figure 2.
In this model, we assume that each tweet message is assigned to one event instance , while is modeled as a joint distribution over the named entities , the date/time when the event occurred, the location where the event occurred and the event-related keywords . This assumption essentially encourages events that involve the same named entities, occur at the same time and in the same location and have similar keyword to be assigned with the same event.
The generative process of LEM is shown below.
Draw the event distribution
For each event , draw multinomial distributions .
For each tweet
Choose an event ,
For each named entity occur in tweet , choose a named entity ,
For each date occur in tweet , choose a date ,
For each location occur in tweet , choose a location ,
For other words in tweet , choose a word .
We use Collapsed Gibbs Sampling [4] to infer the parameters of the model and the latent class assignments for events, given observed data and the total likelihood. Gibbs sampling allows us repeatedly sample from a Markov chain whose stationary distribution is the posterior of from the distribution over that variable given the current values of all other variables and the data. Such samples can be used to empirically estimate the target distribution. Letting the subscript denote a quantity that excludes data from th tweet , the conditional posterior for is:
(1) |
where is the number of tweets that have been assigned to the event ; is the total number of tweets, is the number of times named entity has been associated with event ; is the number of times dates has been associated with event ; is the number of times locations has been assigned with event ; is the number of times keyword has associated with event , counts with notation denote the counts relating to tweet only. are the total numbers of distinct named entities, dates, locations, and words appeared in the whole Twitter corpus respectively. is the total number of events which needs to be set.
Once the class assignments for all events are known, we can easily estimate the model parameters . We set the hyperparameters and run Gibbs sampler for 10,000 iterations and stop the iteration once the log-likelihood of the training data converges under the learned model. Finally we select an entity, a date, a location, and the top 2 keywords of the highest probability of every event to form a 4-tuple as the representation of that event.
To improve the precision of event extraction, we remove the least confident event element from the 4-tuples using the following rule. If element) is less than , where is the sum of probabilities of the other three elements and is a threshold value and is set to 5 empirically, the element will be removed from the extracted results.
In this section, we first describe the Twitter corpus used in our experiments and then present how we build a baseline based on the previously proposed TwiCal system [14], the state-of-the-art open event extraction system on tweets. Finally, we present our experimental results.
We use the First Story Detection (FSD) dataset [10] in our experiment. It consists of 2,499 tweets which are manually annotated with the corresponding event instances resulting in a total of 27 events. The tweets were published between 7th July and 12th September 2011. These events cover a range of categories, from celebrity news to accidents, and from natural disasters to science discoveries. It should be noted here that some event elements such as location is not always available in the tweets. Automatically inferring geolocation of the tweets is a challenging task and will be considered in our future work. For the tweets without time expressions, we used the tweets’ publication dates as a default. The number of tweets for each event ranges from 2 to around 1000. We believe that in reality, events which are mentioned in very few tweets are less likely to be significant. Therefore, the dataset was filtered by removing the events which are mentioned in less than 10 tweets. This results in a final dataset containing 2468 tweets annotated with 21 events.
The baseline we chose is TwiCal [14]. The events extracted in the baseline are represented as a 3-tuple 55TwiCal also groups event instances into event types such as ”Sport” or ”Politics” using LinkLDA which is not considered here., where stands for a non-location named entity, for a date and for an event phrase. We re-implemented the system and evaluate the performance of the baseline on the correctness of the exacted three elements excluding the location element. In the baseline approach, the tuple are extracted in the following ways. Firstly, a named entity recognizer [13] is employed to identify named entities. The TempEx [9] is used to resolve temporal expressions. For each date, the baseline approach chose the entity with the strongest association with the date and form the binary tuple to represent an event. An event phrase extractor trained on annotated tweets is required to extract event-related phrases. Due to the difficulties of re-implementing the sequence labeler without knowing the actual features set and the annotated training data, we assume all the event-related phrases are identified correctly and simply use the event trigger words annotated in the FSD corpus as to form the event 3-tuples. It is worth noting that the F-measure reported for the event phrase extraction is only 64% in the baseline approach [14].
To evaluate the performance of the propose approach, we use , , and as in general information extraction systems [8]. For the 4-tuple , the precision is calculated based on the following criteria:
Do the entity , location and date that we have extracted refer to the same event?
Are the keywords in accord with the event that other extracted elements refer to and are they informative enough to tell us what happened?
If the extracted representation does not contain keywords, its precision is calculated by checking the criteria 1. If the extracted representation contains keywords, its precision is calculated by checking both criteria 1 and 2.
The number of events, , in the LEM model is set to . The performance of the proposed framework is presented in Table 1. The baseline re-implemented here can only output 3-tuples and we simply use the gold standard event trigger words to assign to . Still, we observe that compared to the baseline approach, the performance of our proposed framework evaluated on the 4-tuple achieves nearly 17% improvement on precision. The overall improvement on F-measure is around 7.76%.
Method | Tuple Evaluated | Precision | Recall | F-measure |
---|---|---|---|---|
Baseline | 75% | 76.19% | 75.59% | |
Proposed | 96% | 80.95% | 87.83% | |
Proposed | 92% | 76.19% | 83.35% |
We experimented with two approaches for named entity recognition (NER) in preprocessing. One is to use the NER tool trained specifically on the Twitter data [13], denoted as “TW-NER” in Table 2. The other uses the traditional Stanford NER to extract named entities from news articles published in the same period and then perform fuzzy matching to identify named entities from tweets. The latter method is denoted as “NW-NER” in Table 2. It can be observed from Table 2 that by using NW-NER, the performance of event extraction system is improved significantly by 7.5% and 3% respectively on F-measure when evaluated on 3-tuples (without keywords) or 4-tuples (with keywords).
Method | Tuple Evaluated | Precision | Recall | F-measure |
---|---|---|---|---|
TW-NER | 88% | 76.19% | 80.35% | |
TW-NER | 84% | 76.19% | 79.90% | |
NW-NER | 96% | 80.95% | 87.83% | |
NW-NER | 92% | 76.19% | 83.35% |
We need to set the number of events in the LEM model. Figure 3 shows the performance of event extraction versus different value of . It can be observed that the performance of the proposed framework improves with the increase of the value of until it reaches 25, which is close to the actual number of events in our data. If further increasing , we notice more balanced precision/recall values and a relatively stable F-measure. This shows that our LEM model is less sensitive to the number of events so long as is set to a relatively larger value.
In this paper we have proposed an unsupervised Bayesian model, called the Latent Event Model (LEM), to extract the structured representation of events from social media data. Instead of employing labeled corpora for training, the proposed model only requires the identification of named entities, locations and time expressions. After that, the model can automatically extract events which involving a named entity at certain time, location, and with event-related keywords based on the co-occurrence patterns of the event elements. Our proposed model has been evaluated on the FSD corpus. Experimental results show our proposed framework outperforms the state-of-the-art baseline by over 7% in F-measure. In future work, we plan to investigate inferring geolocations automatically from tweets. We also intend to study a better method to infer date more accurately from tweets and explore efficient ranking strategies to rank evens extracted for a better presentation of results.
This work was funded by the National Natural Science Foundation of China (61103077), Ph.D. Programs Foundation of Ministry of Education of China for Young Faculties (20100092120031), Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, the Fundamental Research Funds for the Central Universities, and the UK’s EPSRC grant EP/L010690/1.