An Extension of BLANC to System Mentions

Xiaoqiang Luo
Google Inc.
111 8th Ave, New York, NY 10011
xql@google.com
&Sameer Pradhan
Harvard Medical School
300 Longwood Ave., Boston, MA 02115
sameer.pradhan@childrens.harvard.edu
   Marta Recasens
Google Inc.
1600 Amphitheatre Pkwy,
Mountain View, CA 94043
recasens@google.com
&Eduard Hovy
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213
hovy@cmu.edu
Abstract

BLANC is a link-based coreference evaluation metric for measuring the quality of coreference systems on gold mentions. This paper extends the original BLANC (“BLANC-gold” henceforth) to system mentions, removing the gold mention assumption. The proposed BLANC falls back seamlessly to the original one if system mentions are identical to gold mentions, and it is shown to strongly correlate with existing metrics on the 2011 and 2012 CoNLL data.

1 Introduction

Coreference resolution aims at identifying natural language expressions (or mentions) that refer to the same entity. It entails partitioning (often imperfect) mentions into equivalence classes. A critically important problem is how to measure the quality of a coreference resolution system. Many evaluation metrics have been proposed in the past two decades, including the MUC measure [9], B-cubed [1], CEAF [3] and, more recently, BLANC-gold [7]. B-cubed and CEAF treat entities as sets of mentions and measure the agreement between key (or gold standard) entities and response (or system-generated) entities, while MUC and BLANC-gold are link-based.

In particular, MUC measures the degree of agreement between key coreference links (i.e., links among mentions within entities) and response coreference links, while non-coreference links (i.e., links formed by mentions from different entities) are not explicitly taken into account. This leads to a phenomenon where coreference systems outputting large entities are scored more favorably than those outputting small entities [3]. BLANC [7], on the other hand, considers both coreference links and non-coreference links. It calculates recall, precision and F-measure separately on coreference and non-coreference links in the usual way, and defines the overall recall, precision and F-measure as the mean of the respective measures for coreference and non-coreference links.

The BLANC-gold metric was developed with the assumption that response mentions and key mentions are identical. In reality, however, mentions need to be detected from natural language text and the result is, more often than not, imperfect: some key mentions may be missing in the response, and some response mentions may be spurious—so-called “twinless” mentions by Stoyanov et al. (2009). Therefore, the identical-mention-set assumption limits BLANC-gold’s applicability when gold mentions are not available, or when one wants to have a single score measuring both the quality of mention detection and coreference resolution. The goal of this paper is to extend the BLANC-gold metric to imperfect response mentions.

We first briefly review the original definition of BLANC, and rewrite its definition using set notation. We then argue that the gold-mention assumption in Recasens and Hovy (2011) can be lifted without changing the original definition. In fact, the proposed BLANC metric subsumes the original one in that its value is identical to the original one when response mentions are identical to key mentions.

The rest of the paper is organized as follows. We introduce the notions used in this paper in Section 2. We then present the original BLANC-gold in Section 3 using the set notation defined in Section 2. This paves the way to generalize it to imperfect system mentions, which is presented in Section 4. The proposed BLANC is applied to the CoNLL 2011 and 2012 shared task participants, and the scores and its correlations with existing metrics are shown in Section 5.

2 Notations

To facilitate the presentation, we define the notations used in the paper.

We use key to refer to gold standard mentions or entities, and response to refer to system mentions or entities. The collection of key entities is denoted by K={ki}i=1|K|, where ki is the ith key entity; accordingly, R={rj}j=1|R| is the set of response entities, and rj is the jth response entity. We assume that mentions in {ki} and {rj} are unique; in other words, there is no duplicate mention.

Let Ck(i) and Cr(j) be the set of coreference links formed by mentions in ki and rj:

Ck(i) ={(m1,m2):m1ki,m2ki,m1m2}
Cr(j) ={(m1,m2):m1rj,m2rj,m1m2}

As can be seen, a link is an undirected edge between two mentions, and it can be equivalently represented by a pair of mentions. Note that when an entity consists of a single mention, its coreference link set is empty.

Let Nk(i,j) (ij) be key non-coreference links formed between mentions in ki and those in kj, and let Nr(i,j) (ij) be response non-coreference links formed between mentions in ri and those in rj, respectively:

Nk(i,j) ={(m1,m2):m1ki,m2kj}
Nr(i,j) ={(m1,m2):m1ri,m2rj}

Note that the non-coreference link set is empty when all mentions are in the same entity.

We use the same letter and subscription without the index in parentheses to denote the union of sets, e.g.,

Ck=iCk(i), Nk=ijNk(i,j)
Cr=jCr(j), Nr=ijNr(i,j)

We use Tk=CkNk and Tr=CrNr to denote the total set of key links and total set of response links, respectively. Clearly, Ck and Nk form a partition of Tk since CkNk=, Tk=CkNk. Likewise, Cr and Nr form a partition of Tr.

We say that a key link l1Tk equals a response link l2Tr if and only if the pair of mentions from which the links are formed are identical. We write l1=l2 if two links are equal. It is easy to see that the gold mention assumption—same set of response mentions as the set of key mentions—can be equivalently stated as Tk=Tr (this does not necessarily mean that Ck=Cr or Nk=Nr).

We also use || to denote the size of a set.

3 Original BLANC

BLANC-gold is adapted from Rand Index [6], a metric for clustering objects. Rand Index is defined as the ratio between the number of correct within-cluster links plus the number of correct cross-cluster links, and the total number of links.

When Tk=Tr, Rand Index can be applied directly since coreference resolution reduces to a clustering problem where mentions are partitioned into clusters (entities):

Rand Index =|CkCr|+|NkNr|12(|Tk|(|Tk|-1)) (1)

In practice, though, the simple-minded adoption of Rand Index is not satisfactory since the number of non-coreference links often overwhelms that of coreference links [7], or, |Nk||Ck| and |Nr||Cr|. Rand Index, if used without modification, would not be sensitive to changes of coreference links.

BLANC-gold solves this problem by averaging the F-measure computed over coreference links and the F-measure over non-coreference links. Using the notations in Section 2, the recall, precision, and F-measure on coreference links are:

Rc(g) =|CkCr||CkCr|+|CkNr| (2)
Pc(g) =|CkCr||CrCk|+|CrNk| (3)
Fc(g) =2Rc(g)Pc(g)Rc(g)+Pc(g); (4)

Similarly, the recall, precision, and F-measure on non-coreference links are computed as:

Rn(g) =|NkNr||NkCr|+|NkNr| (5)
Pn(g) =|NkNr||NrCk|+|NrNk| (6)
Fn(g) =2Rn(g)Pn(g)Rn(g)+Pn(g). (7)

Finally, the BLANC-gold metric is the arithmetic average of Fc(g) and Fn(g):

BLANC(g) =Fc(g)+Fn(g)2. (8)

Superscript g in these equations highlights the fact that they are meant for coreference systems with gold mentions.

Eqn. (8) indicates that BLANC-gold assigns equal weight to Fc(g), the F-measure from coreference links, and Fn(g), the F-measure from non-coreference links. This avoids the problem that |Nk||Ck| and |Nr||Cr|, should the original Rand Index be used.

In Eqn. (2) - (3) and Eqn. (5) - (6), denominators are written as a sum of disjoint subsets so they can be related to the contingency table in [7]. Under the assumption that Tk=Tr, it is clear that Ck=(CkCr)(CkNr), Cr=(CkCr)(NkCr), and so on.

4 BLANC for Imperfect Response Mentions

Under the assumption that the key and response mention sets are identical (which implies that Tk=Tr), Equations (2) to (7) make sense. For example, Rc is the ratio of the number of correct coreference links over the number of key coreference links; Pc is the ratio of the number of correct coreference links over the number of response coreference links, and so on.

However, when response mentions are not identical to key mentions, a key coreference link may not appear in either Cr or Nr, so Equations (2) to (7) cannot be applied directly to systems with imperfect mentions. For instance, if the key entities are {a,b,c} {d,e}; and the response entities are {b,c} {e,f,g}, then the key coreference link (a,b) is not seen on the response side; similarly, it is possible that a response link does not appear on the key side either: (c,f) and (f,g) are not in the key in the above example.

To account for missing or spurious links, we observe that
x  CkTr are key coreference links missing in the response;
x  NkTr are key non-coreference links missing in the response;
x  CrTk are response coreference links missing in the key;
x  NrTk are response non-coreference links missing in the key,
and we propose to extend the coreference F-measure and non-coreference F-measure as follows. Coreference recall, precision and F-measure are changed to:

Rc =|CkCr||CkCr|+|CkNr|+|CkTr| (9)
Pc =|CkCr||CrCk|+|CrNk|+|CrTk| (10)
Fc =2RcPcRc+Pc (11)

Non-coreference recall, precision and F-measure are changed to:

Rn =|NkNr||NkCr|+|NkNr|+|NkTr| (12)
Pn =|NkNr||NrCk|+|NrNk|+|NrTk| (13)
Fn =2RnPnRn+Pn. (14)

The proposed BLANC continues to be the arithmetic average of Fc and Fn:

BLANC=Fc+Fn2. (15)

We observe that the definition of the proposed BLANC, Equ. (9)-(14) subsume the BLANC-gold (2) to (7) due to the following proposition:
If Tk=Tr, then BLANC=BLANC(g).

Proof. We only need to show that Rc=Rc(g), Pc=Pc(g), Rn=Rn(g), and Pn=Pn(g). We prove the first one (the other proofs are similar and elided due to space limitations). Since Tk=Tr and CkTk, we have CkTr; thus CkTr=, and |CkTr|=0. This establishes that Rc=Rc(g).

Indeed, since Ck is a union of three disjoint subsets: Ck=(CkCr)(CkNr)(CkTr), Rc(g) and Rc can be unified as |CkCr||CK|. Unification for other component recalls and precisions can be done similarly. So the final definition of BLANC can be succinctly stated as:

Rc=|CkCr||Ck|, Pc=|CkCr||Cr| (16)
Rn=|NkNr||Nk|, Pn=|NkNr||Nr| (17)
Fc=2|CkCr||Ck|+|Cr|, Fn=2|NkNr||Nk|+|Nr| (18)
BLANC=Fc+Fn2 (19)

4.1 Boundary Cases

Care has to be taken when counts of the BLANC definition are 0. This can happen when all key (or response) mentions are in one cluster or are all singletons: the former case will lead to Nk= (or Nr=); the latter will lead to Ck= (or Cr=). Observe that as long as |Ck|+|Cr|>0, Fc in (18) is well-defined; as long as |Nk|+|Nr|>0, Fn in (18) is well-defined. So we only need to augment the BLANC definition for the following cases:

(1) If Ck=Cr= and Nk=Nr=, then BLANC=I(Mk=Mr), where I() is an indicator function whose value is 1 if its argument is true, and 0 otherwise. Mk and Mr are the key and response mention set. This can happen when a document has no more than one mention and there is no link.

(2) If Ck=Cr= and |Nk|+|Nr|>0, then BLANC=Fn. This is the case where the key and response side has only entities consisting of singleton mentions. Since there is no coreference link, BLANC reduces to the non-coreference F-measure Fn.

(3) If Nk=Nr= and |Ck|+|Cr|>0, then BLANC=Fc. This is the case where all mentions in the key and response are in one entity. Since there is no non-coreference link, BLANC reduces to the coreference F-measure Fc.

4.2 Toy Examples

We walk through a few examples and show how BLANC is calculated in detail. In all the examples below, each lower-case letter represents a mention; mentions in an entity are closed in {}; two letters in () represent a link.

Example 1. Key entities are {abc} and {d}; response entities are {bc} and {de}. Obviously,
    Ck={(ab),(bc),(ac)};
    Nk={(ad),(bd),(cd)};
    Cr={(bc),(de)};
    Nr={(bd),(be),(cd),(ce)}.
Therefore, CkCr={(bc)}, NkNr={(bd),(cd)}, and Rc=13, Pc=12, Fc=25; Rn=23, Pn=24, Fn=47. Finally, BLANC=1735.

Example 2. Key entity is {a}; response entity is {b}. This is boundary case (1): BLANC=0.

Example 3. Key entities are {a}{b}{c}; response entities are {a}{b}{d}. This is boundary case (2): there are no coreference links. Since
    Nk={(ab),(bc),(ca)},
    Nr={(ab),(bd),(ad)},
we have
  NkNr={(ab)}, and Rn=13, Pn=13.
So BLANC=Fn=13.

Example 4. Key entity is {abc}; response entity is {bc}. This is boundary case (3): there are no non-coreference links. Since
   Ck={(ab),(bc),(ca)}, and Cr={(bc)},
we have
    CkCr={(bc)}, and Rc=13, Pc=1,
So BLANC=Fc=24=12.

5 Results

Participant R P BLANC
lee 50.23 49.28 48.84
sapena 40.68 49.05 44.47
nugues 47.83 44.22 45.95
chang 44.71 47.48 45.49
stoyanov 49.37 29.80 34.58
santos 46.74 37.33 41.33
song 36.88 39.69 30.92
sobha 35.42 39.56 36.31
yang 47.95 29.12 36.09
charton 42.32 31.54 35.65
hao 45.41 32.75 36.98
zhou 29.93 45.58 34.95
kobdani 32.29 33.01 32.57
xinxin 36.83 34.39 35.02
kummerfeld 34.84 29.53 30.98
zhang 30.10 43.96 35.71
zhekova 26.40 15.32 15.37
irwin 03.62 28.28 06.28
Table 1: The proposed BLANC scores of the CoNLL-2011 shared task participants.

5.1 CoNLL-2011/12

We have updated the publicly available CoNLL coreference scorer11http://code.google.com/p/reference-coreference-scorers with the proposed BLANC, and used it to compute the proposed BLANC scores for all the CoNLL 2011 [5] and 2012 [4] participants in the official track, where participants had to automatically predict the mentions. Tables 1 and 2 report the updated results.22The order is kept the same as in Pradhan et al. (2011) and Pradhan et al. (2012) for easy comparison.

Participant R P BLANC
Language: Arabic
fernandes 33.43 44.66 37.99
bjorkelund 32.65 45.47 37.93
uryupina 31.62 35.26 33.02
stamborg 32.59 36.92 34.50
chen 31.81 31.52 30.82
zhekova 11.04 62.58 18.51
li 04.60 56.63 08.42
  Language: English
fernandes 54.91 63.66 58.75
martschat 52.00 58.84 55.04
bjorkelund 52.01 59.55 55.42
chang 52.85 55.03 53.86
chen 50.52 56.82 52.87
chunyang 51.19 55.47 52.65
stamborg 54.39 54.88 54.42
yuan 50.58 54.29 52.11
xu 45.99 54.59 46.47
shou 49.55 52.46 50.44
uryupina 44.15 48.89 46.04
songyang 40.60 50.85 45.10
zhekova 41.46 33.13 34.80
xinxin 44.39 32.79 36.54
li 25.17 52.96 31.85
  Language: Chinese
chen 48.45 62.44 54.10
yuan 53.15 40.75 43.20
bjorkelund 47.58 45.93 44.22
xu 44.11 36.45 38.45
fernandes 42.36 61.72 49.63
stamborg 39.60 55.12 45.89
uryupina 33.44 56.01 41.88
martschat 27.24 62.33 37.89
chunyang 37.43 36.18 36.77
xinxin 36.46 39.79 37.85
li 21.61 62.94 30.37
chang 18.74 40.76 25.68
zhekova 21.50 37.18 22.89
Table 2: The proposed BLANC scores of the CoNLL-2012 shared task participants.

5.2 Correlation with Other Measures

R P F1
MUC 0.975 0.844 0.935
B-cubed 0.981 0.942 0.966
CEAF-m 0.941 0.923 0.966
CEAF-e 0.797 0.781 0.919
Table 3: Pearson’s r correlation coefficients between the proposed BLANC and the other coreference measures based on the CoNLL 2011/2012 results. All p-values are significant at < 0.001.

Figure 1 shows how the proposed BLANC measure works when compared with existing metrics such as MUC, B-cubed and CEAF, using the BLANC and F1 scores. The proposed BLANC is highly positively correlated with the other measures along R, P and F1 (Table 3), showing that BLANC is able to capture most entity-based similarities measured by B-cubed and CEAF. However, the CoNLL data sets come from OntoNotes [2], where singleton entities are not annotated, and BLANC has a wider dynamic range on data sets with singletons [7]. So the correlations will likely be lower on data sets with singleton entities.

Figure 1: Correlation plot between the proposed BLANC and the other measures based on the CoNLL 2011/2012 results. All values are F1 scores.

6 Conclusion

The original BLANC-gold [7] requires that system mentions be identical to gold mentions, which limits the metric’s utility since detected system mentions often have missing key mentions or spurious mentions. The proposed BLANC is free from this assumption, and we have shown that it subsumes the original BLANC-gold. Since BLANC works on imperfect system mentions, we have used it to score the CoNLL 2011 and 2012 coreference systems. The BLANC scores show strong correlation with existing metrics, especially B-cubed and CEAF-m.

Acknowledgments

We would like to thank the three anonymous reviewers for their invaluable suggestions for improving the paper. This work was partially supported by grants R01LM10090 from the National Library of Medicine.

References

  • [1] A. Bagga and B. Baldwin(1998) Algorithms for scoring coreference chains. pp. 563–566. Cited by: 1.
  • [2] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw and R. Weischedel(2006-06) OntoNotes: the 90% solution. New York City, USA, pp. 57–60. External Links: Link Cited by: 5.2.
  • [3] X. Luo(2005) On coreference resolution performance metrics. Cited by: 1, 1.
  • [4] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina and Y. Zhang(2012-07) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. Jeju Island, Korea, pp. 1–40. External Links: Link Cited by: 5.1.
  • [5] S. Pradhan, L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel and N. Xue(2011-06) CoNLL-2011 shared task: modeling unrestricted coreference in OntoNotes. Portland, Oregon, USA, pp. 1–27. External Links: Link Cited by: 5.1.
  • [6] W. M. Rand(1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (336), pp. 846–850. Cited by: 3.
  • [7] M. Recasens and E. Hovy(2011-10) BLANC: implementing the Rand index for coreference evaluation. Natural Language Engineering 17, pp. 485–510. External Links: ISSN 1469-8110, Document, Link Cited by: 1, 1, 1, 3, 3, 5.2, 6.
  • [8] V. Stoyanov, N. Gilbert, C. Cardie and E. Riloff(2009) Conundrums in noun phrase coreference resolution: making sense of the state-of-the-art. ACL ’09, Stroudsburg, PA, USA, pp. 656–664. External Links: ISBN 978-1-932432-46-6, Link Cited by: 1.
  • [9] M. Vilain, J. Burger, J. Aberdeen, D. Connolly and L. Hirschman(1995) A model-theoretic coreference scoring scheme. pp. 45–52. Cited by: 1.