Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evaluation on several datasets.
Variation poses a serious challenge for determining who or what a name refers to. For instance, Wikipedia contains more than 100 variations of the name Barack Obama as redirects to the U.S. President article, including:
President Obama | Barack H. Obama, Jr. |
Barak Obamba | Barry Soetoro |
To relate different names, one solution is to use specifically tailored measures of name similarity such as Jaro-Winkler similarity []. This approach is brittle, however, and fails to adapt to the test data. Another option is to train a model like stochastic edit distance from known pairs of similar names [], but this requires supervised data in the test domain.
Even the best model of name similarity is not enough by itself, since two names that are similar—even identical—do not necessarily corefer. Document context is needed to determine whether they may be talking about two different people.
In this paper, we propose a method for jointly (1) learning similarity between names and (2) clustering name mentions into entities, the two major components of cross-document coreference resolution systems []. Our model is an evolutionary generative process based on the name variation model of , which stipulates that names are often copied from previously generated names, perhaps with mutation (spelling edits). This can deduce that rather than being names for different entities, Barak Obamba and Barock obama more likely arose from the frequent name Barack Obama as a common ancestor, which accounts for most of their letters. This can also relate seemingly dissimilar names via multiple steps in the generative process:
Taylor Swift T-Swift T-Swizzle
Our model learns without supervision that these all refer to the the same entity. Such creative spellings are especially common on Twitter and other social media; we give more examples of coreferents learned by our model in Section 8.4.
Our primary contributions are improvements on for the entity clustering task. Their inference procedure only clustered types (distinct names) rather than tokens (mentions in context), and relied on expensive matrix inversions for learning. Our novel approach features:
A topical model of which entities from previously written text an author tends to mention from previously written text.
A name mutation model that is sensitive to features of the input and output characters and takes a reader’s comprehension into account.
A scalable Markov chain Monte Carlo sampler used in training and inference.
A minimum Bayes risk decoding procedure to pick an output clustering. The procedure is applicable to any model capable of producing a posterior over coreference decisions.
We evaluate our approach by comparing to several baselines on datasets from three different genres: Twitter, newswire, and blogs. \todo[author=jason,color=RedOrange,fancyline,size=,]summarize results? and add John Smith here if you do it in the conclusions
Cross-document coreference resolution (CDCR) was first introduced by . Most approaches since then are based on the intuitions that coreferent names tend to have “similar” spellings and tend to appear in “similar” contexts. The distinguishing feature of our system is that both notions of similarity are learned together without supervision.
We adopt a “phylogenetic” generative model of coreference. The basic insight is that coreference is created when an author thinks of an entity that was mentioned earlier in a similar context, and mentions it again in a similar way. The author may alter the name mention string when copying it, but both names refer to the same entity. Either name may later be copied further, leading to an evolutionary tree of mentions—a phylogeny. \todo[author=jason,color=RedOrange,fancyline,size=,]please add a figure with an example of a phylogeny? Phylogenetic models are new to information extraction. In computational historical linguistics, have also modeled the mutation of strings along the edges of a phylogeny; but for them the phylogeny is observed and most mentions are not, while we observe the mentions only.
To apply our model to the CDCR task, we observe that the probability that two name mentions are coreferent is the probability that they arose from a common ancestor in the phylogeny. So we design a Monte Carlo sampler to reconstruct likely phylogenies. A phylogeny must explain every observed name. While our model is capable of generating each name independently, a phylogeny will generally achieve higher probability if it explains similar names as being similar by mutation (rather than by coincidence). Thus, our sampled phylogenies tend to make similar names coreferent—especially long or unusual names that would be expensive to generate repeatedly, and especially in contexts that are topically similar and therefore have a higher prior probability of coreference.
For learning, we iteratively adjust our model’s parameters to better explain our samples. That is, we do unsupervised training via Monte Carlo EM.
What is learned? An important component of a CDCR system is its model of name similarity [], which is often fixed up front. This role is played in our system by the name mutation model, which we take to be a variant of stochastic edit distance []. Rather than fixing its parameters before we begin CDCR, we learn them (without supervision) as part of CDCR, by training from samples of reconstructed phylogenies.
Name similarity is also an important component of within-document coreference resolution, and efforts in that area bear resemblance to our approach. describe an “entity-centered” model where a distance-dependent Chinese restaurant process is used to pick previous coreferent mentions within a document. Similarly, learn a mention similarity model based on labeled data. Our cross-document setting has no observed mention ordering and no observed entities: we must sum over all possibilities, a challenging inference problem.
The second major component of CDCR is context-based disambiguation of similar or identical names that refer to the same entity. Like and we use topics as the contexts, but learn mention topics jointly with other model parameters. \todo[author=noa,color=SeaGreen,fancyline,size=,]I removed a paragraph here that I think is funny and not really necessary. \todo[author=jason,color=RedOrange,fancyline,size=,]I think with this paragraph Mark was trying to work in citations of Wick and Yogotama, but I agree the paragraph didn’t fit. Mark? I added a nocite to ensure they’re at least in the bib.
[author=jason,color=RedOrange,fancyline,size=,]probably add discussion of “name-ethnicity classification and ethnicity-sensitive name matching” (AAAI 2012)
Let denote an ordered sequence of distinct named-entity mentions in documents . We assume that each document has a (single) known \todo[author=mark,color=RoyalBlue,fancyline,size=,]We got dinged in the last submission for trying to generalize the model to language without any data to back it up. I think its fine to do this, but we should remember that was an issue for a reviewer. \todo[author=jason,color=RedOrange,fancyline,size=,]As you see, I already amplified the footnote to give our defense of this. I also expanded the discussion of multilinguality in Appendix 4. language, and that its mentions and their types have been identified by a named-entity recognizer. We use the object-oriented notation for attribute of mention .
Our model generates an ordered sequence although we do not observe its order. Thus each mention has latent position (e.g., ). The entire corpus, including these entities, is generated according to standard topic model assumptions; we first generate a topic distribution for a document, then sample topics and words for the document []. However, any topic may generate an entity type, e.g. person, which is then replaced by a specific name: when person is generated, the model chooses a previous mention of any person and copies it, perhaps mutating its name.11We make the closed-world assumption that the author is only aware of previous mentions from our corpus. This means that two mentions cannot be derived from a common ancestor outside our corpus. To mitigate this unrealistic assumption, we allow any ordering of the observed mentions, not respecting document timestamps or forcing the mentions from a given document to be generated as a contiguous subsequence of . Alternatively, the model may manufacture a name for a new person, though the name itself may not be new.
If all previous mentions were equally likely, this would be a Chinese Restaurant Process (CRP) in which frequently mentioned entities are more likely to be mentioned again (“the rich get richer”). We refine that idea by saying that the current topic, language, and document influence the choice of which previous mention to copy, similar to the distance-dependent CRP [].22Unlike the ddCRP, our generative story is careful to prohibit derivational cycles: each mention is copied from a previous mention in the latent ordering. This is why our phylogeny is a tree, and why our sampler is more complex. Also unlike the ddCRP, we permit asymmetric “distances”: if a certain topic or language likes to copy mentions from another, the compliment is not necessarily returned. This will help distinguish multiple John Smith entities if they tend to appear in different contexts.
Formally, each mention is derived from a parent mention where (the parent came first), (same entity) and is a copy or mutation of . In the special case where is a first mention of , is the special symbol , is a newly allocated entity of some appropriate type, and the name is generated from scratch.
Our goal is to reconstruct mappings that specify the latent properties of the mentions . The mapping forms a phylogenetic tree on the mentions, with root . Each entity corresponds to a subtree that is rooted at some child of . The mapping gives an ordering consistent with that tree in the sense that . Finally, the mapping specifies, for each mention, the topic that generated it. While and are not necessary for creating coref clusters, they are needed to produce .
Given a few constants that are referenced in the main text, we assume that the corpus was generated as follows.
First, for each topic and each language , choose a multinomial over the word vocabulary, from a symmetric Dirichlet with concentration parameter . Then set (entity count), (mention count), and for each document index :
Choose the document’s length and language . (The distributions used to choose these are unimportant because these variables are always observed.)
Choose its topic distribution from an asymmetric Dirichlet prior with parameters [].33Extension: This choice could depend on the language . \todo[author=jason,color=RedOrange,fancyline,size=,]If we’re using [], then don’t we have to say how is sampled? Note: I simplified it from here and in the main document. So now is unnormalized.
For each token position :
Choose a topic .
Choose a word conditioned on the topic and language, .
If is a named entity type (person, place, org, …) rather than an ordinary word, then increment and:
create a new mention with
Choose the parent from a distribution conditioned on the attributes just set (see §4.1). \todo[author=jason,color=RedOrange,fancyline,size=,]if we restore then it should be mentioned here
If , increment and set a new entity . Else set .
Choose from a distribution conditioned on and (see §4.2).
Notice that the tokens in document are exchangeable: by collapsing out , we can regard them as having been generated from a CRP. Thus, for fixed values of the non-mention tokens and their topics, the probability of generating the mention sequence is proportional to the product of the probabilities of the choices in step 3 at the positions where mentions were generated. These choices generate a topic (from the CRP for document ), a type (from ), a parent mention (from the distribution over previous mentions), and a name string (conditioned on the parent’s name if any). §5 uses this fact to construct an MCMC sampler for the latent parts of .
To select a parent for a mention of type , a simple model (as mentioned above) would be a CRP: each previous mention of the same type is selected with probability proportional to 1, and is selected with probability proportional to . A larger choice of results in smaller entity clusters, because it prefers to create new entities of type rather than copying old ones.
We modify this story by re-weighting and previous mentions according to their relative suitability as the parent of : \todo[author=noa,color=SeaGreen,fancyline,size=,]I have features on here since e.g. position in document may be predictive of if a new entity is being started. \todo[author=jason,color=RedOrange,fancyline,size=,]Ok, but then the factors of should no longer be used, so I deleted them and simplified. Presumably you’re not still using those?? In fact, we no longer have any parameter.
(1) |
where ranges over and all previous mentions of the same type as , that is, mentions such that and . The normalizing constant is chosen so that the probabilities sum to 1.
This is a conditional log-linear model parameterized by , where . \todo[author=jason,color=RedOrange,fancyline,size=,]formerly which I think was wrong The features are extracted from the attributes of and . Our most important feature tests whether . This binary feature has a high weight if authors mainly choose mentions from the same topic. To model which (other) topics tend to be selected, we also have a binary feature for each parent topic and each topic pair .44Many other features could be added. In a multilingual setting, one would similarly want to model whether English authors select Arabic mentions. One could also imagine features that reward proximity in the generative order (), local linguistic relationships (when and ), or social information flow (e.g., from mainstream media to Twitter). One could also make more specific versions of any feature by conjoining it with the entity type .
Let denote a mention with parent . As in , its name is a stochastic transduction of its parent’s name . That is,
(2) |
is given by the probability that applying a random sequence of edits to the characters of would yield . The contextual probabilities of different edits depend on learned parameters .
(2) is the total probability of all edit sequences that derive from . It can be computed in time by dynamic programming.
The probability of a single edit sequence, which corresponds to a monotonic alignment of to , is a product of individual edit probabilities of the form , which is conditioned on the next input character . The edit replaces input with output (where is the empty string and is the alphabet of language ). Insertions and deletions are the cases where respectively or —we do not allow both at once. All other edits are substitutions. When is the special end-of-string symbol , the only allowed edits are the insertion and the substitution . We define the edit probability using a locally normalized log-linear model:
(3) |
We use a small set of simple feature functions , which consider conjunctions of the attributes of the characters and : character, character class (letter, digit, etc.), and case (upper vs. lower).\todo[author=noa,color=SeaGreen,fancyline,size=,]Will add more features back in for camera-ready most likely: they slow down the experiments a lot and didn’t help that much on dev data.
More generally, the probability (2) may also be conditioned on other variables such as on the languages and —this leaves room for a transliteration model when —and on the entity type . The features in (3) may then depend on these variables as well.
Notice that we use a locally normalized probability for each edit. This enables faster and simpler training than the similar model of , which uses a globally normalized probability for the whole edit sequence.
When , we are generating a new name . We use the same model, taking to be the empty string (but with rather than as the end-of-string symbol). This yields a feature-based unigram language model (whose character probabilities may differ from usual insertion probabilities because they see as the lookahead character).
Pragmatics. We can optionally make the model more sophisticated. Authors tend to avoid names that readers would misinterpret (given the previously generated names). The edit model thinks that is relatively high (because CIA is a short string) and so is . But in fact, if CIA has already been frequently used to refer to the Central Intelligence Agency, then an author is unlikely to use it for a different entity.
To model this pragmatic effect, we multiply our definition of by an extra factor , where is the effect strength.55Currently we omit the step of renormalizing this deficient model. Our training procedure also ignores the extra factor. Here is the probability that a reader correctly identifies the entity . We take this to be the probability that a reader who knows our sub-models would guess some parent having the correct entity (or if is a first mention): . Here ranges over mentions (including ) that precede in the ordering , and —defined later in sec. 5.3—is proportional to the posterior probability that , given \todo[author=jason,color=RedOrange,fancyline,size=,]really also given all the rest of the state of the sampler. Also, for footnote, could consider guess of name and topic .66Better, one could integrate over the reader’s guess of .
We use a block Gibbs sampler, which from an initial state repeats these steps:
Sample the ordering from its conditional distribution given all other variables.
Sample the topic vector likewise.
Sample the phylogeny likewise.
Output the current sample .
It is difficult to draw exact samples at steps 1 and 2. Thus, we sample or from a simpler proposal distribution, but correct the discrepancy using the Independent Metropolis-Hastings (IMH) strategy: with an appropriate probability, reject the proposed new value and instead use another copy of the current value [].
We resample the ordering of the mentions , conditioned on the other variables. The current phylogeny already defines a partial order on , since each parent must precede its children. For instance, phylogeny (a) below requires and . This partial order is compatible with 2 total orderings, and . By contrast, phylogeny (b) requires the total ordering .
{subfigure}[b]0.3
{subfigure}[b]0.3
We first sample an ordering (the ordering of mentions with parent , i.e. all mentions) uniformly at random from the set of orderings compatible with the current . (We provide details about this procedure in Appendix A.)77The full version of this paper is available at http://cs.jhu.edu/~noa/publications/phylo-acl-14.pdf However, such orderings are not in fact equiprobable given the other variables—some orderings better explain why that phylogeny was chosen in the first place, according to our competitive parent selection model (§4.1). To correct for this bias using IMH, we accept the proposed ordering with probability
(4) |
where is the current ordering. Otherwise we reject and reuse for the new sample.
Each context word and each named entity is associated with a latent topic. The topics of context words are assumed exchangeable, and so we resample them using Gibbs sampling []. \todo[author=jason,color=RedOrange,fancyline,size=,]when and how often do you do this? do you use auxiliary data?
Unfortunately, this is prohibitively expensive for the (non-exchangeable) topics of the named mentions . A Gibbs sampler would have to choose a new value for with probability proportional to the resulting joint probability of the full sample. This probability is expensive to evaluate because changing will change the probability of many edges in the current phylogeny . (Equation (1) puts is in competition with other parents, so every mention that follows must recompute how happy it is with its current parent .)
Rather than resampling one topic at a time, we resample as a block. We use a proposal distribution for which block sampling is efficient, and use IMH to correct the error in this proposal distribution.
Our proposal distribution is an undirected graphical model whose random variables are the topics and whose graph structure is given by the current phylogeny : \todo[author=jason,color=RedOrange,fancyline,size=,]abuse of notation in referring to here
(5) |
is an approximation to the posterior distribution over . As detailed below, a proposal can be sampled from in time where is the number of topics, because the only interactions among topics are along the edges of the tree . The unary factor gives a weight for each possible value of , and the binary factor gives a weight for each possible value of the pair .
The factors in (5) approximate the topic model’s prior distribution over . is proportional to the probability that a Gibbs sampling step for an ordinary topic model would choose this value of . This depends on whether—in the current sample— is currently common in ’s document and is commonly generated by . It ignores the fact that we will also be resampling the topics of the other mentions.\todo[author=jason,color=RedOrange,fancyline,size=,]cut a nice point here for space
The factors in (5) approximate (up to a constant factor), where is the current phylogeny. Specifically, approximates the probability of a single edge. It ought to be given by (1), but we use only the numerator of (1), which avoids modeling the competition among parents.
We sample from using standard methods, \todo[author=jason,color=RedOrange,fancyline,size=,]can we cite a section of Koller & Friedman? similar to sampling from a linear-chain CRF by running the backward algorithm followed by forward sampling. Specifically, we run the sum-product algorithm from the leaves up to the root , at each node computing the following for each topic :
Then we sample from the root down to the leaves, first sampling from , then at each sampling the topic to be with probability proportional to .
Again we use IMH to correct for the bias in : we accept the resulting proposal with probability
(6) |
While might seem slow to compute because it contains many factors (1) with different denominators , one can share work by visiting the mentions in their order . Most summands in were already included in , where is the latest previous mention having the same attributes as (e.g., same topic).\todo[author=jason,color=RedOrange,fancyline,size=,]Nick, I assume you use this trick?
[author=jason,color=RedOrange,fancyline,size=,]cut out a speedup here for space
It is easy to resample the phylogeny. For each , we must choose a parent from among the possible parents (having and ). Since the ordering prevents cycles, the resulting phylogeny is indeed a tree.
Given the topics , the ordering , and the observed names, we choose an value according to its posterior probability. This is proportional to , independent of any other mention’s choice of parent. The two factors here are given by (1) and (2) respectively. As in the previous section, the denominators in the factors can be computed efficiently with shared work.
With the pragmatic model (section 4.2), the parent choices are no longer independent; then the samples of should be corrected by IMH as usual.
The initial sampler state is obtained as follows. \todo[author=jason,color=RedOrange,fancyline,size=,]nick, a LaTeXnote: you can use inline lists with itemize* if you want to compactify like this (we’re already using enumitem package) (1) We fix topics via collapsed Gibbs sampling []. The sampler is run for 1000 iterations, and the final sampler state is taken to be . This process treats all topics as exchangeable, including those associated with named entities.\todo[author=jason,color=RedOrange,fancyline,size=,]Well, maybe they’re exchangeable if you’re ignoring the names at this step and just treating the types as words. Are you? (2) Given the topic assignment , initialize to the phylogeny rooted at that maximizes . This is a maximum rooted directed spanning tree problem that can be solved in time []. The weight is defined as in section 5.3—except that since we do not yet have an ordering , we do not restrict the possible values of to mentions with . (3) Given , sample an ordering using the procedure described in §5.1.
Evaluating the likelihood and its partial derivatives with respect to the parameters of the model requires marginalizing over our latent variables. As this marginalization is intractable, we resort to Monte Carlo EM procedure [] which iterates the following two steps:
E-step: Collect samples by MCMC simulation as in §5, given current model parameters and .
M-step: Improve and to increase88We actually do MAP-EM, which augments (7) by adding the log-likelihoods of and under a Gaussian prior.
(7) |
It is not necessary to locally maximize at each M-step, merely to improve it if it is not already at a local maximum []. We improve it by a single update: at the th M-step, we update our parameters to
(8) |
[author=jason,color=RedOrange,fancyline,size=,] looks too much like a summation; please change name. Also, should be described as a (diagonal) matrix? where is a fixed scaling term and is an adaptive learning rate given by AdaGrad [].
We now describe how to compute the gradient . The gradient with respect to the parent selection parameters is
(9) |
The outer summation ranges over all edges in the samples. The other variables in (9) are associated with the edge being summed over. That edge explains a mention as a mutation of some parent in the context of a particular sample . The possible parents range over and the mentions that precede according to the ordering , while the features and distribution depend on the topics .
As for the mutation parameters, let be the fraction of samples in which is the parent of . This is the expected number of times that the string mutated into . Given this weighted set of string pairs, let be the expected number of times that edit was chosen in context : this can be computed using dynamic programming to marginalize over the latent edit sequence that maps to , for each . The gradient of with respect to is
(10) |
From a single phylogeny , we deterministically obtain a clustering by removing the root . Each of the resulting connected components corresponds to a cluster of mentions. Our model gives a distribution over phylogenies (given observations and learned parameters )—and thus gives a posterior distribution over clusterings , which can be used to answer various queries.
A traditional query is to request a single clustering . We prefer the clustering that minimizes Bayes risk (MBR) []:
(11) |
This minimizes our expected loss, where denotes the loss associated with picking when the true clustering is . In practice, we again estimate the expectation by sampling values.
The Rand index []—unlike our actual evaluation measure—is an efficient choice of loss function for use with (11):
where the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) use the clustering to evaluate how well classifies the mention pairs as coreferent or not. More similar clusterings achieve larger , with iff . In all cases, .
The MBR decision rule for the (negated) Rand index is easily seen to be equivalent to
(12) | ||||
where denotes coreference according to . As explained above, the are coreference probabilities that can be estimated from a sample of clusterings .
This objective corresponds to min-max graph cut [], an NP-hard problem with an approximate solution [].99In our experiments, we run the clustering algorithm five times, initialized from samples chosen at random from the last 10% of the sampler run, and keep the clustering that achieved highest expected Rand score.
In this section, we describe experiments on three different datasets. Our main results are described first: Twitter features many instances of name variation that we would like our model to be able to learn. We also report the performance of different ablations of our full approach, in order to see which consistently helped across the different splits. We report additional experiments on the ACE 2008 corpus, and on a political blog corpus, to demonstrate that our approach is applicable in different settings.
For Twitter and ACE 2008, we report the standard metric []. For the political blog dataset, the reference does not consist of entity annotations, and so we follow the evaluation procedure of .\todo[author=jason,color=RedOrange,fancyline,size=,]re-insert reference here to significance testing discussion (and actual tests) in supplementary material
Data. We use a novel corpus of Twitter posts discussing the 2013 Grammy Award ceremony. This is a challenging corpus, featuring many instances of name variation. The dataset consists of five splits (by entity), the smallest of which is 604 mentions and the largest is 1374. We reserve the largest split for development purposes, and report our results on the remaining four. Appendix B provides more detail about the dataset.
Baselines. We use the discriminative entity clustering algorithm of as our baseline; their approach was found to outperform another generative model which produced a flat clustering of mentions via a Dirichlet process mixture model. Their method uses Jaro-Winkler string similarity to match names, then clusters mentions with matching names (for disambiguation) by comparing their unigram context distributions using the Jenson-Shannon metric. We also compare to the exact-match baseline, which assigns all strings with the same name to the same entity.
Procedure. We run four test experiments in which one split is used to pick model hyperparameters and the remaining three are used for test. For the discriminative baseline, we tune the string match threshold, context threshold, and the weight of the context model prior (all via grid search). For our model, we tune only the fixed weight of the root feature, which determines the precision/recall trade-off (larger values of this feature result in more attachments to and hence more entities). We leave other hyperparameters fixed: 16 latent topics, and Gaussian priors on all log-linear parameters. For phylo, the entity clustering is the result of (1) training the model using EM, (2) sampling from the posterior to obtain a distribution over clusterings, and (3) finding a consensus clustering. We use 20 iterations of EM with 100 samples per E-step for training, and use 1000 samples after training to estimate the posterior. We report results using three variations of our model: phylo does not consider mention context (all mentions effectively have the same topic) and determines mention entities from a single sample of (the last); phylo+topic adds context (§5.2); phylo+topic+mbr uses the full posterior and consensus clustering to pick the output clustering (§7). Our results are shown in Table 1.1010Our single-threaded implementation took around 15 minutes per fold of the Twitter corpus on a personal laptop with a 2.3 Ghz Intel Core i7 processor (including time required to parse the data files). Typical acceptance rates for ordering and topic proposals ranged from 0.03 to 0.08.
Mean Test | |||
---|---|---|---|
P | R | F1 | |
exact-match | 99.6 | 53.7 | 69.8 |
92.1 | 69.8 | 79.3 | |
phylo | 85.3 | 91.4 | 88.7 |
phylo+topic | 92.8 | 90.8 | 91.8 |
phylo+topic+mbr | 92.9 | 90.9 | 91.9 |
Data. We use the ACE 2008 dataset, which is described in detail in . It is split into a development portion and a test portion. The baseline system took the first mention from each (gold) within-document coreference chain as the canonical mention, ignoring other mentions in the chain; we follow the same procedure in our experiments.1111That is, each within-document coreference chain is mapped to a single mention as a preprocessing step.
Baselines & Procedure. We use the same baselines as in §8.1. On development data, modeling pragmatics as in §4.2 gave large improvements for organizations (8 points in F-measure), \todo[author=jason,color=RedOrange,fancyline,size=,]but maybe just for org? correcting the tendency to assume that short names like CIA were coincidental homonyms. Hence we allowed and tuned it on development data.1212We used only a simplified version of the pragmatic model, approximating as 1 or 0 according to whether . We also omitted the IMH step from section 5.3. The other results we report do not use pragmatics at all, since we found that it gave only a slight improvement on Twitter. \todo[author=jason,color=RedOrange,fancyline,size=,]let’s test pragmatics everywhere for the final version, and fix the issues in the footnote Results are in Table 2.
Test | |||||||
P | R | F1 | |||||
per | exact-match | 98.0 | 81.2 | 88.8 | |||
95.0 | 88.9 | 91.9 | |||||
phylo+topic+mbr | 97.2 | 88.6 | 92.7 | ||||
org | exact-match | 98.2 | 78.3 | 87.1 | |||
92.1 | 88.5 | 90.3 | |||||
phylo+topic+mbr | 95.5 | 80.9 | 87.6 | ||||
Data. The CMU political blogs dataset consists of 3000 documents about U.S. politics []. Preprocessed as described in , the data consists of 10647 entity mentions. Unlike our other datasets, mentions are not annotated with entities: the reference consists of a table of 126 entities, where each row is the canonical name of one entity.
Baselines. We compare to the system results reported in Figure 2 of . This includes a baseline hierarchical clustering approach, the “EEA” name canonicalization system of , as well the model proposed by . Like the output of our model, the output of their hierarchical clustering baseline is a mention clustering, and therefore must be mapped to a table of canonical entity names to compare to the reference table.
Procedure & Results We tune our method as in previous experiments, on the initialization data used by which consists of a subset of 700 documents of the full dataset. The tuned model then produced a mention clustering on the full political blog corpus. As the mapping from clusters to a table is not fully detailed in , we used a simple heuristic: the most frequent name in each cluster is taken as the canonical name, augmented by any titles from a predefined list appearing in any other name in the cluster. The resulting table is then evaluated against the reference, as described in . We achieved a response score of 0.17 and a reference score of 0.61. Though not state-of-the-art, this result is close to the score of the “EEA” system of , as reported in Figure 2 of , \todo[author=jason,color=RedOrange,fancyline,size=,]so we lose slightly to EEA but EEA loses big to Yogotama? which is specifically designed for the task of canonicalization.
On the Twitter dataset, we obtained a 12.6-point F1 improvement over the baseline. To understand our model’s behavior, we looked at the sampled phylogenetic trees on development data. One reason our model does well in this noisy domain is that it is able to relate seemingly dissimilar names via successive steps. For instance, our model learned to relate many variations of LL Cool J:
Cool James | LLCoJ | El-El Cool John |
LL | LL COOL JAMES | LLCOOLJ |
In the sample we inspected, these mentions were also assigned the same topic, further boosting the probability of the configuration.
The ACE dataset, consisting of editorialized newswire, naturally contains less name variation than Twitter data. Nonetheless, we find that the variation that does appear is often properly handled by our model. For instance, we see several instances of variation due to transliteration that were all correctly grouped together, such as Megawati Soekarnoputri and Megawati Sukarnoputri. The pragmatic model was also effective in grouping common acronyms into the same entity.
We found that multiple samples tend to give different phylogenies (so the sampler is mobile), but essentially the same clustering into entities (which is why consensus clustering did not improve much over simply using the last sample). Random restarts of EM might create more variety by choosing different locally optimal parameter settings. It may also be beneficial to explore other sampling techniques [].
Our method assembles observed names into an evolutionary tree. However, the true tree must include many names that fall outside our small observed corpora, so our model would be a more appropriate fit for a far larger corpus. Larger corpora also offer stronger signals that might enable our Monte Carlo methods to mix faster and detect regularities more accurately.
A common error of our system is to connect mentions that share long substrings, such as different persons who share a last name, or different organizations that contain University of. A more powerful name mutation than the one we use here would recognize entire words, for example inserting a common title or replacing a first name with its common nickname. Modeling the internal structure of names [] in the mutation model is a promising future direction.
Our primary contribution consists of new modeling ideas, and associated inference techniques, for the problem of cross-document coreference resolution. We have described how writers systematically plunder () and then systematically modify () the work of past writers. Inference under such models could also play a role in tracking evolving memes and social influence, not merely in establishing strict coreference. Our model also provides an alternative to the distance-dependent CRP.
Our implementation is available for research use at: https://bitbucket.org/noandrews/phyloinf.
In general, the subtree rooted at vertex defines a partial ordering on its own mentions. To sample a total ordering uniformly at random from among those compatible with that partial ordering, first recursively sample orderings compatible with the subtrees rooted at ’s children. Then uniformly sample an interleaving of the orderings,and prepend itself to this interleaving to obtain . To sample an interleaving, select one of the input orderings at random, with probability proportional to its size , and print and delete its first element. Repeating this step until all of the input orderings are empty will print a random interleaving. Note that in the base case where is a leaf (so ), this procedure terminates immediately, having printed the empty ordering. Our is the output of running this recursive process with .
Using the Twitter 1% streaming API, we collected all tweets during the 2013 Grammy music awards ceremony, which occurred on Feb 10, 2013 between 8pm eastern (1:00am GMT) and 11:30pm (4:30 GMT). We used Carmen geolocation [1] to identify tweets that originated in the United States or Canada and removed tweets that did not have a language of English selected as the UI for the tweet author. This yielded a total of 564,892 tweets. We then selected tweets that contained the string “grammy” (case insensitive), reducing the set to 50,429 tweets. These tweets were processed for POS and NER using the University of Washington Twitter NLP tools 1313https://github.com/aritter/twitter_nlp [3]. Tweets that did not include a person mention were removed. For simplicity, we selected a single person reference per tweet. The final set contained 15,736 tweets. Of these, 5000 have been annotated for entities.
A first human annotator made a first pass of 1,000 tweets and then considered the remaining 4,000 tweets. This provided an opportunity to refine the annotation guidelines after reviewing some of the data. The annotator was asked to assign a unique integer to each entity and to annotate each tweet containing a mention of that person with the corresponding integer. Additionally, the annotator was asked to fix incorrect mention strings. If the extracted mention was incorrect or referred to a non-person, it was removed. If it was mostly correct, but omitted/excluded a token, the annotator corrected it. Similar to Guo et al. (2013), ambiguous mentions were removed. However, unlike their annotation effort, all persons, including those not in Wikipedia, were included. Mentions that were comprised of usernames were excluded (e.g. @taylorswift13). Following this protocol, the annotator removed 423 tweets. A second annotator inspected the annotations to correct mistakes and fix ambiguous references. The final annotated corpus contains 4,577 annotated tweets and 273 distinct entities. This corpus was then split into five folds by first sorting the entities by number of mentions, then performing systematic sampling of the entities on the sorted list.