BLANC is a link-based coreference evaluation metric for measuring the quality of coreference systems on gold mentions. This paper extends the original BLANC (“BLANC-gold” henceforth) to system mentions, removing the gold mention assumption. The proposed BLANC falls back seamlessly to the original one if system mentions are identical to gold mentions, and it is shown to strongly correlate with existing metrics on the 2011 and 2012 CoNLL data.
Coreference resolution aims at identifying natural language expressions (or mentions) that refer to the same entity. It entails partitioning (often imperfect) mentions into equivalence classes. A critically important problem is how to measure the quality of a coreference resolution system. Many evaluation metrics have been proposed in the past two decades, including the MUC measure [9], B-cubed [1], CEAF [3] and, more recently, BLANC-gold [7]. B-cubed and CEAF treat entities as sets of mentions and measure the agreement between key (or gold standard) entities and response (or system-generated) entities, while MUC and BLANC-gold are link-based.
In particular, MUC measures the degree of agreement between key coreference links (i.e., links among mentions within entities) and response coreference links, while non-coreference links (i.e., links formed by mentions from different entities) are not explicitly taken into account. This leads to a phenomenon where coreference systems outputting large entities are scored more favorably than those outputting small entities [3]. BLANC [7], on the other hand, considers both coreference links and non-coreference links. It calculates recall, precision and F-measure separately on coreference and non-coreference links in the usual way, and defines the overall recall, precision and F-measure as the mean of the respective measures for coreference and non-coreference links.
The BLANC-gold metric was developed with the assumption that response mentions and key mentions are identical. In reality, however, mentions need to be detected from natural language text and the result is, more often than not, imperfect: some key mentions may be missing in the response, and some response mentions may be spurious—so-called “twinless” mentions by Stoyanov et al. (2009). Therefore, the identical-mention-set assumption limits BLANC-gold’s applicability when gold mentions are not available, or when one wants to have a single score measuring both the quality of mention detection and coreference resolution. The goal of this paper is to extend the BLANC-gold metric to imperfect response mentions.
We first briefly review the original definition of BLANC, and rewrite its definition using set notation. We then argue that the gold-mention assumption in Recasens and Hovy (2011) can be lifted without changing the original definition. In fact, the proposed BLANC metric subsumes the original one in that its value is identical to the original one when response mentions are identical to key mentions.
The rest of the paper is organized as follows. We introduce the notions used in this paper in Section 2. We then present the original BLANC-gold in Section 3 using the set notation defined in Section 2. This paves the way to generalize it to imperfect system mentions, which is presented in Section 4. The proposed BLANC is applied to the CoNLL 2011 and 2012 shared task participants, and the scores and its correlations with existing metrics are shown in Section 5.
To facilitate the presentation, we define the notations used in the paper.
We use key to refer to gold standard mentions or entities, and response to refer to system mentions or entities. The collection of key entities is denoted by , where is the key entity; accordingly, is the set of response entities, and is the response entity. We assume that mentions in and are unique; in other words, there is no duplicate mention.
Let and be the set of coreference links formed by mentions in and :
As can be seen, a link is an undirected edge between two mentions, and it can be equivalently represented by a pair of mentions. Note that when an entity consists of a single mention, its coreference link set is empty.
Let be key non-coreference links formed between mentions in and those in , and let be response non-coreference links formed between mentions in and those in , respectively:
Note that the non-coreference link set is empty when all mentions are in the same entity.
We use the same letter and subscription without the index in parentheses to denote the union of sets, e.g.,
We use and to denote the total set of key links and total set of response links, respectively. Clearly, and form a partition of since , . Likewise, and form a partition of .
We say that a key link equals a response link if and only if the pair of mentions from which the links are formed are identical. We write if two links are equal. It is easy to see that the gold mention assumption—same set of response mentions as the set of key mentions—can be equivalently stated as (this does not necessarily mean that or ).
We also use to denote the size of a set.
BLANC-gold is adapted from Rand Index [6], a metric for clustering objects. Rand Index is defined as the ratio between the number of correct within-cluster links plus the number of correct cross-cluster links, and the total number of links.
When , Rand Index can be applied directly since coreference resolution reduces to a clustering problem where mentions are partitioned into clusters (entities):
Rand Index | (1) |
In practice, though, the simple-minded adoption of Rand Index is not satisfactory since the number of non-coreference links often overwhelms that of coreference links [7], or, and . Rand Index, if used without modification, would not be sensitive to changes of coreference links.
BLANC-gold solves this problem by averaging the F-measure computed over coreference links and the F-measure over non-coreference links. Using the notations in Section 2, the recall, precision, and F-measure on coreference links are:
(2) | ||||
(3) | ||||
(4) |
Similarly, the recall, precision, and F-measure on non-coreference links are computed as:
(5) | ||||
(6) | ||||
(7) |
Finally, the BLANC-gold metric is the arithmetic average of and :
(8) |
Superscript in these equations highlights the fact that they are meant for coreference systems with gold mentions.
Eqn. (8) indicates that BLANC-gold assigns equal weight to , the F-measure from coreference links, and , the F-measure from non-coreference links. This avoids the problem that and , should the original Rand Index be used.
Under the assumption that the key and response mention sets are identical (which implies that ), Equations (2) to (7) make sense. For example, is the ratio of the number of correct coreference links over the number of key coreference links; is the ratio of the number of correct coreference links over the number of response coreference links, and so on.
However, when response mentions are not identical to key mentions, a key coreference link may not appear in either or , so Equations (2) to (7) cannot be applied directly to systems with imperfect mentions. For instance, if the key entities are {a,b,c} {d,e}; and the response entities are {b,c} {e,f,g}, then the key coreference link (a,b) is not seen on the response side; similarly, it is possible that a response link does not appear on the key side either: (c,f) and (f,g) are not in the key in the above example.
To account for missing or spurious links, we observe that
x are key coreference links missing in the response;
x are key non-coreference links missing in the response;
x are response coreference links missing in the key;
x are response non-coreference links missing in the key,
and we propose to extend the coreference F-measure and non-coreference F-measure
as follows. Coreference recall, precision and F-measure are changed to:
(9) | ||||
(10) | ||||
(11) |
Non-coreference recall, precision and F-measure are changed to:
(12) | ||||
(13) | ||||
(14) |
The proposed BLANC continues to be the arithmetic average of and :
(15) |
We observe that the definition of the proposed BLANC, Equ. (9)-(14)
subsume the BLANC-gold (2)
to (7) due to the following proposition:
If , then .
Proof. We only need to show that , , , and . We prove the first one (the other proofs are similar and elided due to space limitations). Since and , we have ; thus , and . This establishes that .
Indeed, since is a union of three disjoint subsets: , and can be unified as . Unification for other component recalls and precisions can be done similarly. So the final definition of BLANC can be succinctly stated as:
(16) | ||||
(17) | ||||
(18) | ||||
(19) |
Care has to be taken when counts of the BLANC definition are 0. This can happen when all key (or response) mentions are in one cluster or are all singletons: the former case will lead to (or ); the latter will lead to (or ). Observe that as long as , in (18) is well-defined; as long as , in (18) is well-defined. So we only need to augment the BLANC definition for the following cases:
(1) If and , then , where is an indicator function whose value is 1 if its argument is true, and 0 otherwise. and are the key and response mention set. This can happen when a document has no more than one mention and there is no link.
(2) If and , then . This is the case where the key and response side has only entities consisting of singleton mentions. Since there is no coreference link, BLANC reduces to the non-coreference F-measure .
(3) If and , then . This is the case where all mentions in the key and response are in one entity. Since there is no non-coreference link, BLANC reduces to the coreference F-measure .
We walk through a few examples and show how BLANC is calculated in detail. In all the examples below, each lower-case letter represents a mention; mentions in an entity are closed in {}; two letters in () represent a link.
Example 1. Key entities are and ; response entities are and . Obviously,
;
;
;
.
Therefore, , , and
, , ;
, , . Finally,
.
Example 2. Key entity is ; response entity is . This is boundary case (1): .
Example 3. Key entities are ; response entities are
. This is boundary case (2): there are no coreference links.
Since
,
,
we have
, and , .
So .
Example 4. Key entity is ; response entity is
. This is boundary case (3): there are no non-coreference links.
Since
, and ,
we have
, and , ,
So .
Participant | R | P | BLANC |
---|---|---|---|
lee | 50.23 | 49.28 | 48.84 |
sapena | 40.68 | 49.05 | 44.47 |
nugues | 47.83 | 44.22 | 45.95 |
chang | 44.71 | 47.48 | 45.49 |
stoyanov | 49.37 | 29.80 | 34.58 |
santos | 46.74 | 37.33 | 41.33 |
song | 36.88 | 39.69 | 30.92 |
sobha | 35.42 | 39.56 | 36.31 |
yang | 47.95 | 29.12 | 36.09 |
charton | 42.32 | 31.54 | 35.65 |
hao | 45.41 | 32.75 | 36.98 |
zhou | 29.93 | 45.58 | 34.95 |
kobdani | 32.29 | 33.01 | 32.57 |
xinxin | 36.83 | 34.39 | 35.02 |
kummerfeld | 34.84 | 29.53 | 30.98 |
zhang | 30.10 | 43.96 | 35.71 |
zhekova | 26.40 | 15.32 | 15.37 |
irwin | 03.62 | 28.28 | 06.28 |
We have updated the publicly available CoNLL coreference scorer11http://code.google.com/p/reference-coreference-scorers with the proposed BLANC, and used it to compute the proposed BLANC scores for all the CoNLL 2011 [5] and 2012 [4] participants in the official track, where participants had to automatically predict the mentions. Tables 1 and 2 report the updated results.22The order is kept the same as in Pradhan et al. (2011) and Pradhan et al. (2012) for easy comparison.
Participant | R | P | BLANC |
---|---|---|---|
Language: Arabic | |||
fernandes | 33.43 | 44.66 | 37.99 |
bjorkelund | 32.65 | 45.47 | 37.93 |
uryupina | 31.62 | 35.26 | 33.02 |
stamborg | 32.59 | 36.92 | 34.50 |
chen | 31.81 | 31.52 | 30.82 |
zhekova | 11.04 | 62.58 | 18.51 |
li | 04.60 | 56.63 | 08.42 |
Language: English | |||
fernandes | 54.91 | 63.66 | 58.75 |
martschat | 52.00 | 58.84 | 55.04 |
bjorkelund | 52.01 | 59.55 | 55.42 |
chang | 52.85 | 55.03 | 53.86 |
chen | 50.52 | 56.82 | 52.87 |
chunyang | 51.19 | 55.47 | 52.65 |
stamborg | 54.39 | 54.88 | 54.42 |
yuan | 50.58 | 54.29 | 52.11 |
xu | 45.99 | 54.59 | 46.47 |
shou | 49.55 | 52.46 | 50.44 |
uryupina | 44.15 | 48.89 | 46.04 |
songyang | 40.60 | 50.85 | 45.10 |
zhekova | 41.46 | 33.13 | 34.80 |
xinxin | 44.39 | 32.79 | 36.54 |
li | 25.17 | 52.96 | 31.85 |
Language: Chinese | |||
chen | 48.45 | 62.44 | 54.10 |
yuan | 53.15 | 40.75 | 43.20 |
bjorkelund | 47.58 | 45.93 | 44.22 |
xu | 44.11 | 36.45 | 38.45 |
fernandes | 42.36 | 61.72 | 49.63 |
stamborg | 39.60 | 55.12 | 45.89 |
uryupina | 33.44 | 56.01 | 41.88 |
martschat | 27.24 | 62.33 | 37.89 |
chunyang | 37.43 | 36.18 | 36.77 |
xinxin | 36.46 | 39.79 | 37.85 |
li | 21.61 | 62.94 | 30.37 |
chang | 18.74 | 40.76 | 25.68 |
zhekova | 21.50 | 37.18 | 22.89 |
R | P | F1 | |
---|---|---|---|
MUC | 0.975 | 0.844 | 0.935 |
B-cubed | 0.981 | 0.942 | 0.966 |
CEAF-m | 0.941 | 0.923 | 0.966 |
CEAF-e | 0.797 | 0.781 | 0.919 |
Figure 1 shows how the proposed BLANC measure works when compared with existing metrics such as MUC, B-cubed and CEAF, using the BLANC and F1 scores. The proposed BLANC is highly positively correlated with the other measures along R, P and F1 (Table 3), showing that BLANC is able to capture most entity-based similarities measured by B-cubed and CEAF. However, the CoNLL data sets come from OntoNotes [2], where singleton entities are not annotated, and BLANC has a wider dynamic range on data sets with singletons [7]. So the correlations will likely be lower on data sets with singleton entities.
The original BLANC-gold [7] requires that system mentions be identical to gold mentions, which limits the metric’s utility since detected system mentions often have missing key mentions or spurious mentions. The proposed BLANC is free from this assumption, and we have shown that it subsumes the original BLANC-gold. Since BLANC works on imperfect system mentions, we have used it to score the CoNLL 2011 and 2012 coreference systems. The BLANC scores show strong correlation with existing metrics, especially B-cubed and CEAF-m.
We would like to thank the three anonymous reviewers for their invaluable suggestions for improving the paper. This work was partially supported by grants R01LM10090 from the National Library of Medicine.