A chance-corrected measure of inter-annotator agreement for syntax

Arne Skjærholt
Language technology group, dept. of informatics
University of Oslo
arnskj@ifi.uio.no

Abstract

Following the works of \citeNCarletta96 and \citeNArt:Poe08, there is an increasing consensus within the field that in order to properly gauge the reliability of an annotation effort, chance-corrected measures of inter-annotator agreement should be used. With this in mind, it is striking that virtually all evaluations of syntactic annotation efforts use uncorrected parser evaluation metrics such as bracket $F_{1}$ (for phrase structure) and accuracy scores (for dependencies).

In this work we present a chance-corrected metric based on Krippendorff’s $\alpha$ , adapted to the structure of syntactic annotations and applicable both to phrase structure and dependency annotation without any modifications. To evaluate our metric we first present a number of synthetic experiments to better control the sources of noise and gauge the metric’s responses, before finally contrasting the behaviour of our chance-corrected metric with that of uncorrected parser evaluation metrics on real corpora.¹¹The code used to produce the data in this paper, and some of the datasets used, are available to download at https://github.com/arnsholt/syn-agreement/

\tikzset

edgefromparent/.style=-¿,draw,font=

1 Introduction

It is a truth universally acknowledged that an annotation task in good standing be in possession of a measure of inter-annotator agreement (IAA). However, no such measure is in widespread use for the task of syntactic annotation. This is due to a mismatch between the formulation of the agreement measures, which assumes that the annotations have no or relatively little internal structure, and syntactic annotation where structure is the entire point of the annotation. For this reason efforts to gauge the quality of syntactic annotation are hampered by the need to fall back to simple accuracy measures. As shown in \citeNArt:Poe08, such measures are biased in favour of annotation schemes with fewer categories and do not account for skewed distributions between classes, which can give high observed agreement, even if the annotations are inconsistent.

In this article we propose a family of chance-corrected measures of agreement, applicable to both dependency- and constituency-based syntactic annotation, based on Krippendorff’s $\alpha$ and tree edit distance. First we give an overview of traditional agreement measures and why they are insufficient for syntax, before presenting our proposed metrics. Next, we present a number of synthetic experiments performed in order to find the best distance function for this kind of annotation; finally we contrast our new metric and simple accuracy scores as applied to real-world corpora before concluding and presenting some potential avenues for future work.

1.1 Previous work

The definitive reference for agreement measures in computational linguistics is \citeNArt:Poe08, who argue forcefully in favour of the use of chance-corrected measures of agreement over simple accuracy measures. However, most evaluations of syntactic treebanks use simple accuracy measures such as bracket $F_{1}$ scores for constituent trees (NEGRA, []; TIGER, []; Cat3LB, []; The Arabic Treebank, []) or labelled or unlabelled attachment scores for dependency syntax (PDT, []; PCEDT []; Norwegian Dependency Treebank, []). The only work we know of using chance-corrected metrics is \citeNRag:Dic13, who use MASI [] to measure agreement on dependency relations and head selection in multi-headed dependency syntax, and \citeNBha:Sha12, who compute Cohen’s $\kappa$ [] on dependency relations in single-headed dependency syntax. A limitation of the first approach is that token ID becomes the relevant category for the purposes of agreement, while the second approach only computes agreements on relations, not on structure.

In grammar-driven treebanking (or parsebanking), the problems encountered are slightly different. In HPSG and LFG treebanking annotators do not annotate structure directly. Instead, the grammar parses the input sentences, and the annotator selects the correct parse (or rejects all the candidates) based on discriminants²²A discriminant is an attribute of the analyses produced by the grammar where some of the analyses differ, e.g. is the word jump a noun or a verb, or does a PP attach to a VP or the VP’s object NP. of the parse forest. In this context, \citeNdeCastro11 developed a variant of $\kappa$ that measures agreement over discriminant selection. This is different from our approach in that agreement is computed on annotator decisions rather than on the treebanked analyses, and is only applicable to grammar-based approaches such as HPSG and LFG treebanking.

The idea of using edit distance as the basis for an inter-annotator agreement metric has previously been explored by \citeNFournier13. However that work used a boundary edit distance as the basis of a metric for the task of text segmentation.

1.2 Notation

In this paper, we mostly follow the notation and terminology of \citeNArt:Poe08, with some additions. The key components in an agreement study are the items annotated, the coders who make judgements on individual items, and the annotations created for the items. We denote these as follows:

•

The set of items $I=\{i_{1},i_{2},\dots\}$
•

The set of coders $C=\{c_{1},c_{2},\dots\}$
•

The set of annotations $X$ is a set of sets $X=\{X_{i}|i\in I\}$ where each set $X_{i}=\{x_{ic}|c\in C\}$ contains the annotations for each item. If not all coders annotate all items, the different $X_{i}$ will be of different sizes.

In the case of nominal categorisation we will also use the set $K$ of possible categories.

2 The metric

The most common metrics used in computational linguistics are the metrics $\kappa$ [, introduced to computational linguistics by []] and $\pi$ []. These metrics express agreement on a nominal coding task as the ratio $\kappa,\pi=\nicefrac{{A_{o}-A_{e}}}{{1-A_{e}}}$ where $A_{o}$ is the observed agreement and $A_{e}$ the expected agreement according to some model of “random” annotation. Both metrics have essentially the same model of expected agreement:

A_{e}=\sum_{k\in K}P(k|c_{1})P(k|c_{2})

(1)

differing only in how they estimate the probabilities: $\kappa$ assigns separate probability distributions to each coder based on their observed behaviour, while $\pi$ uses the same distribution for both coders based on their aggregate behaviour.

Now, if we want to perform this same kind of evaluation on syntactic annotation it is not possible to use $\kappa$ or $\pi$ directly. In the case of dependency-based syntax we could conceivably use a variant of these metrics by considering the ID of a token’s head as a categorical variable (the approach taken in []), but we argue that this is not satisfactory. This use of the metrics would consider agreement on categories such as “tokens whose head is token number 24”, which is obviously not a linguistically informative category. Thus we have to reject this way of assessing the reliability of dependency syntax annotation. Also, this approach is not directly generalisable to constituency-based syntax.

For dependency syntax we could generalise these metrics similarly to how $\kappa$ is generalised to $\kappa_{w}$ to handle partial credit for overlapping annotations. Let the function $\textrm{LAS}(t_{1},t_{2})$ be the number of tokens with the same head and label in the two trees $t_{1}$ and $t_{2}$ , $T(i)$ the set of trees possible for an item $i\in I$ , and tokens the number of tokens in the corpus. Then we can compute an expected agreement as follows:

	$\displaystyle A_{e}=\frac{1}{\textrm{tokens}}\sum_{i\in I}\sum_{t_{1},t_{2}\in T% (i)^{2}}\textrm{LAS}_{e}(t_{1},t_{2})$		(2)
	$\displaystyle\textrm{LAS}_{e}(t_{1},t_{2})=P(t_{1}\|c_{1})P(t_{2}\|c_{2})\textrm% {LAS}(t_{1},t_{2})$

We see three problems with this approach. First of all the number of possible trees for a sentence grows exponentially with sentence length, which means that explicitly iterating over all possible such pairs is computationally intractable, nor have we been able to easily derive an algorithm for this particular problem from standard algorithms.

Second, the question of which model to use for $P(t|c)$ is not straightforward. It is possible to use generative parsing models such as PCFGs or the generative dependency models of \citeNEisner96, but agreement metrics require a model of random annotation, and as such using models designed for parsing runs the risk of over-estimating $A_{e}$ , resulting in artificially low agreement scores.

Finally, it may be hard to establish a consensus in the field of which particular metric to use. As shown by the existence of three different metrics ( $\kappa$ , $\pi$ and $S$ []) for the relatively simple task of nominal coding, the choice of model for $P(t|c)$ will not be obvious, and thus differing choices of generative model as well as different choices for parameters such as smoothing will result in subtly different agreement metrics. The results of these different metrics will not be directly comparable, which will make the results of groups using different metrics unnecessarily hard to compare.

\subfloat

[The original dependency tree] \Tree[.ROOT \edgenode[auto=left] Pred; [.saw \edgenode[auto=right] Subj; I \edgenode[auto=left] Obj; [.man \edgenode[auto=right] Det; the ] ] ] \subfloat[The tree used in comparisons] \tikzsetevery node/.append style=font=\Tree[. $\epsilon$ [.Pred Subj [.Obj Det ] ] ]

Figure 1: Transformation of dependency trees before comparison

Instead, we propose to use an agreement measure based on Krippendorff’s $\alpha$ [] and tree edit distance. In this approach we compare tree structures directly, which is extremely parsimonious in terms of assumptions, and furthermore sidesteps the problem of probabilistically modelling annotators’ behaviour entirely. Krippendorff’s $\alpha$ is not as commonly used as $\kappa$ and $\pi$ , but it has the advantage of being expressed in terms of an arbitrary distance function $\delta$ .

A full derivation of $\alpha$ is beyond the scope of this article, and we will simply state the formula used to compute the agreement. Krippendorff’s $\alpha$ is normally expressed in terms of the ratio of observed and expected disagreements: $\alpha=1-\nicefrac{{D_{o}}}{{D_{e}}}$ , where $D_{o}$ is the mean squared distance between annotations of the same item and $D_{e}$ the mean squared distance between all pairs of annotations:

	$\displaystyle D_{o}$	$\displaystyle=\sum_{i\in I}\frac{1}{\|X_{i}\|-1}\sum_{c\in C}\sum_{c^{\prime}\in C% }\delta(x_{ic},x_{ic^{\prime}})^{2}$
	$\displaystyle D_{e}$	$\displaystyle=\frac{1}{\sum_{i\in I}\|X_{i}\|-1}\sum_{i\in I}\sum_{c\in C}\sum_{% i^{\prime}\in I}\sum_{c^{\prime}\in C}\delta(x_{ic},x_{i^{\prime}c^{\prime}})^% {2}$

Note that in the expression for $D_{e}$ , we are computing the difference between annotations for different items; thus, our distance function for syntactic trees needs to be able to compute the difference between arbitrary trees for completely unrelated sentences. The function $\delta$ can be any function as long as it is a metric; that is, it must be (1) non-negative, (2) symmetric, (3) zero only for identical inputs, and (4) it must obey the triangle inequality:

1.

$\forall x,y:\delta(x,y)\geq 0$
2.

$\forall x,y:\delta(x,y)=\delta(x,y)$
3.

$\forall x,y:\delta(x,y)=0\Leftrightarrow x=y$
4.

$\forall x,y,z:\delta(x,y)+\delta(y,z)\geq\delta(x,z)$

This immediately excludes metrics like ParsEval [] and Leaf-Ancestor [], since they assume that the trees being compared are parses of the same sentence. Instead, we base our work on tree edit distance. The tree edit distance (TED) problem is defined analogously to the more familiar problem of string edit distance: what is the minimum number of edit operations required to transform one tree into the other? See \citeNBille05 for a thorough introduction to the tree edit distance problem and other related problems. For this work, we used the algorithm of \citeNZha:Sha89. Tree edit distance has previously been used in the TedEval software [] for parser evaluation agnostic to both annotation scheme and theoretical framework, but this by itself is still an uncorrected accuracy measure and thus unsuitable for our purposes.³³While it is quite different from other parser evaluation schemes, TedEval does not correct for chance agreement and is thus an uncorrected metric. It could of course form the basis for a corrected metric, given a suitable measure of expected agreement.

When comparing syntactic trees, we only want to compare dependency relations or non-terminal categories. Therefore we remove the leaf nodes in the case of phrase structure trees, and in the case of dependency trees we compare trees whose edges are unlabelled and nodes are labelled with the dependency relation between that word and its head; the root node receives the label $\epsilon$ . An example of this latter transformation is shown in Figure 1.

\tikzset

every node/.append style=font=\Tree[. $\epsilon$ [.Pred Subj [.Obj Det ] ] ] \tikzsetevery node/.append style=font=\Tree[. $\epsilon$ [.Pred Subj [. Obj Det [.Atr [. Pred [.Obj Det ] ] ] ] ] ] \tikzsetevery node/.append style=font=\Tree[. $\epsilon$ [.Pred Subj ] ]

Figure 2: Three trees with distance zero using

\delta_{diff}

We propose three different distance functions for the agreement computation: the unmodified tree edit distance function, denoted $\delta_{plain}$ , a second function $\delta_{diff}(x,y)=\mathrm{TED}(x,y)-\mathrm{abs}(|x|-|y|)$ , the edit distance minus the difference in length between the two sentences, and finally $\delta_{norm}(x,y)=\nicefrac{{\mathrm{TED}(x,y)}}{{|x|+|y|}}$ , the edit distance normalised to the range $[0,1]$ .⁴⁴We can easily show that $|x|+|y|$ is an upper bound on the TED, corresponding to deleting all nodes in the source tree and inserting all the nodes in the target.

The plain TED is the simplest in terms of parsimony assumptions, however it may overestimate the difference between sentences, we intuitively find to be syntactically similar. For example the only difference between the two leftmost trees in Figure 2 is a modifier, but $\delta_{plain}$ gives them distance 4 and $\delta_{diff}$ 0. On the other hand, $\delta_{diff}$ might underestimate some distances as well; for example the leftmost and rightmost trees also have distance zero using $\delta_{diff}$ , despite our syntactic intuition that the difference between a transitive and an intransitive should be taken account of.

The third distance function, $\delta_{norm}$ , takes into account a slightly different concern; namely that when comparing a long sentence and a short sentence, the distance has to be quite large simply to account for the difference in number of nodes, unlike comparing two short or two long sentences. Normalising to the range $[0,1]$ puts all pairs on an equal footing.

However, we cannot a priori say which of the three functions is the optimal choice of distance functions. The different functions have different properties, and different advantages and drawbacks, and the nature of their strengths and weaknesses differ. We will therefore perform a number of synthetic experiments to investigate their properties in a controlled environment, before applying them to real-world data.

3 Synthetic experiments

In the previous section, we proposed three different agreement metrics $\alpha_{plain}$ , $\alpha_{diff}$ and $\alpha_{norm}$ , each involving different trade-offs. Deciding which of these metrics is the best one for our purposes of judging the consistency of syntactic annotation poses a bit of a conundrum. We could at this point apply our metrics to various real corpora and compare the results, but since the consistency of the corpora is unknown, it’s impossible to say whether the best metric is the one resulting in the highest scores, the lowest scores or somewhere in the middle. To properly settle this question, we first performed a number of synthetic experiments to gauge how the different metrics respond to disagreement.

The general approach we take is based on that used by \citeNMathet:etal12, adapted to dependency trees. An already annotated corpus, in our case 100 randomly selected sentences from the Norwegian Dependency Treebank [], are taken as correct and then permuted to produce “annotations” of different quality. For dependency trees, the input corpus is permuted as follows:

1.

Each token has a probability $p_{relabel}$ of being assigned a different label uniformly at random from the set of labels used in the corpus.
2.

Each token has a probability $p_{reattach}$ of being assigned a new head uniformly at random from the set of tokens not dominated by the token.

The second permutation process is dependent on the order the tokens are processed, and we consider the tokens in the post-order⁵⁵That is, the child nodes of a node are all processed before the node itself. Nodes on the same level are traversed from left to right. as dictated by the original tree. This way tokens close to the root have a fair chance of having candidate heads if they are selected. A pre-order traversal would result in tokens close to the root having few options, and in particular if the root has a single child, that node has no possible new heads unless one of its children has been assigned the root as its new head first. For example in the trees in figure 2, assigning any other head than the root to the Pred nodes directly dominated by the root will result in invalid (cyclic and unconnected) dependency trees. Traversing the tokens in the linear order dictated by the sentence has similar issues for tokens close to the root and close to the start of the sentence.

Figure 3: Mean agreement over ten runs

For our first set of experiments, we set $p_{relabel}=p_{reattach}$ and evaluated the different agreement metrics for 10 evenly spaced $p$ -values between 0.1 and 1.0. Initial exploration of the data showed that the mean follows the median very closely regardless of metric and perturbation level, and therefore we only report the mean scores across runs in this paper. The results of these experiments are shown in Figure 3, with the labelled attachment score⁶⁶The de facto standard parser evaluation metric in dependency parsing: the percentage of tokens that receive the correct head and dependency relation. (LAS) for comparison.

The $\alpha_{diff}$ metric is clearly extremely sensitive to noise, with $p=0.1$ yielding mean $\alpha_{diff}=15.8\%$ , while $\alpha_{norm}$ is more lenient than both LAS and $\alpha_{plain}$ , with mean $\alpha_{norm}=14.5\%$ at $p=1$ , quite high compared to $\textrm{LAS}=0.9\%$ , $\alpha_{plain}=-6.8\%$ and $\alpha_{diff}=-246\%$ . To further study the sensitivity of the metrics to the two kinds of noise, we performed an additional set of experiments, setting one $p=0$ while varying the other over the same range as in the previous experiment, the results of which are shown in Figures 4 and 5.

The LAS curves are mostly unremarkable, with one exception: Mean LAS at $p_{reattach}=1$ of Figure 5 is 23.9%, clearly much higher than we would expect if the trees were completely random. In comparison, mean LAS when only labels are perturbed is 4.1%, and since the sample space of trees of size $n$ is clearly much larger than that of relabellings, a uniform random selection of tree would yield a LAS much closer to 0. This shows that our tree shuffling algorithm has a non-uniform distribution over the sample space.

Figure 4: Mean agreement over ten runs,

p_{reattach}=0

Figure 5: Mean agreement over ten runs,

p_{relabel}=0

While the behaviour of our alphas and LAS are relatively similar in Figure 3, Figures 4 and 5 show that they do in fact have important differences. Whereas LAS responds linearly to perturbation of both labels and structure, with its parabolic behaviour in Figure 3 being simply the product of these two linear responses, the $\alpha$ metrics respond differently to structural noise and label noise, with label disagreements being penalised less harshly than structural disagreements.

The reason for the strictness of the $\alpha_{diff}$ metric and the laxity of $\alpha_{norm}$ is the effects the modified distance functions have on the distribution of distances. The $\delta_{diff}$ function causes an extreme shift of the distances towards 0; more than 30% of the sentence pairs have distance 0, 1, or 2, which causes $D_{e}^{diff}$ to be extremely low and thus gives disproportionally large weight to non-zero distances in $D_{o}^{diff}$ . On the other hand $\delta_{norm}$ causes a rightward shift of the distances, which results in a high $D_{e}^{norm}$ and thus individual disagreements having less weight.

4 Real-world corpora

Synthetic experiments do not always fully reflect real-world behaviour, however. Therefore we will also evaluate our metrics on real-world inter-annotator agreement data sets. In our evaluation, we will contrast labelled accuracy, the standard parser evaluation metric, and our three $\alpha$ metrics. In particular, we are interested in the correlation (or lack thereof) between LAS and the alphas, and whether the results of our synthetic experiments correspond well with the results on real-world IAA sets. Finally, we also evaluate the metric on both dependency and phrase structure data.

4.1 The corpora

We obtained⁷⁷We contacted a number of treebank projects, among them the Penn Treebank and the Prague Dependency Treebank, but not all of them had data available. data from four different corpora. Three of the data sets are dependency treebanks (NDT, CDT, PCEDT) and one phrase structure treebank (SSD), and of the dependency treebanks the PCEDT contains semantic dependencies, while the other two have traditional syntactic dependencies. The number of annotators and sizes of the different data sets are summarised in Table LABEL:tbl:corpora.

NDT

The Norwegian Dependency Treebank [] is a dependency treebank constructed at the National Library of Norway. The data studied in this work has previously been used by \citeNSkjaerholt13 to study agreement, but using simple accuracy measures (UAS, LAS) rather than chance-corrected measures. The IAA data set is divided into three parts, corresponding to different parsers used to preprocess the data before annotation; what we term NDT 1 through 3 correspond to what \citeNSkjaerholt13 labels Danish, Swedish and Norwegian, respectively.

CDT

The Copenhagen Dependency Treebanks [] is a collection of parallel dependency treebanks, containing data from the Danish PAROLE corpus [] in the original Danish and translated into English, Italian and Spanish.

\ctable

[botcap, caption=Sizes of the different IAA corpora, label=tbl:corpora, mincapwidth=]lcc \tnote[a]2 annotators \tnote[b]4 annotators, avg. 2.8 annotators/text (min. 2, max. 4) \tnote[c]3 annotators, avg. 2.7 annotators/text \tnote[d]11 annotators, avg. 2.5 annotators/text (min. 2, max. 6) \tnote[e]3 annotators, avg. 2.9 annotators/sent. \FLCorpus Align Sentences Align Tokens \MLNDT 1\tmark[a] Align 130 Align 1674 \NNNDT 2\tmark[a] Align 110 Align 1594 \NNNDT 3\tmark[a] Align 150 Align 1997 \MLCDT (da)\tmark[a] Align 162 Align 2394 \NNCDT (en)\tmark[a] Align 264 Align 5528 \NNCDT (es)\tmark[b] Align 55 Align 924 \NNCDT (it)\tmark[c] Align 136 Align 3057 \MLPCEDT\tmark[d] Align3531 Align61737 \MLSSD\tmark[e] Align 96 Align 1581 \LL

PCEDT

The Prague Czech-English Dependency Treebank 2.0 \citeNPCEDT2 is a parallel corpus of English and Czech, consisting of English data from the Wall Street Journal Section of the Penn Treebank [] and Czech translations of the English data. The syntactic annotations are layered and consist of an analytical layer similar to the annotations in most other dependency treebanks, and a more semantic tectogrammatical layer.

Our data set consists of a common set of analytical annotations shared by all the annotators, and the tectogrammatical analyses built on top of this common foundation. A distinguishing feature of the tectogrammatical analyses, vis a vis the other treebanks we are using, is that semantically empty words only take part in the analytical annotation layer and nodes are inserted at the tectogrammatical layer to represent covert elements of the sentence not present in the surface syntax of the analytical layer. Thus, inserting and deleting nodes is a central part of the task of tectogrammatical annotation, unlike the more surface-oriented annotation of our other treebanks, where the tokenisation is fixed before the text is annotated.

SSD

The Star-Sem Data is a portion of the dataset released for the *SEM 2012 shared task [], parsed using the LinGO English Resource Grammar (ERG, []) and the resulting parse forest disambiguated based on discriminants. The ERG is an HPSG-based grammar, and as such its analyses are attribute-value matrices (AVMs); an AVM is not a tree but a directed acyclic graph however, and for this reason we compute agreement not on the AVM but the so-called derivation tree. This tree describes the types of the lexical items in the sentence and the bottom-up ordering of rule applications used to produce the final analysis and can be handled by our procedure like any phrase-structure tree.

4.2 Agreement results

To evaluate our corpora, we compute the three $\alpha$ variants described in the previous two sections, and compare these with labelled accuracy scores.

When there are more than two annotators, we generalise the metric to be the average pairwise LAS for each sentence, weighted by the length of the sentence. Let $\textrm{LAS}(t_{1},t_{2})$ be the fraction of tokens with identical head and label in the trees $t_{1}$ and $t_{2}$ ; the pairwise labelled accuracy $\textrm{LAS}_{p}(X)$ of a set of annotations $X$ as described in section 1.2 is:

	$\displaystyle\textrm{LAS}_{p}(X)=\frac{1}{\sum_{i}\|x_{i1}\|}\sum\frac{\|x_{i1}\|% \Lambda(X_{i})}{\nicefrac{{\|X_{i}\|(\|X_{i}\|-1)}}{{2}}}$		(3)
	$\displaystyle\Lambda(X_{i})=\sum_{c=1}^{\|C\|}\sum_{c^{\prime}=c+1}^{\|C\|}\textrm% {LAS}(x_{ic},x_{ic^{\prime}})$

This is equivalent to the traditional metric in the case where there are only two annotators.

\ctable

[botcap, caption=Agreement scores on real-world corpora, label=tbl:alpha-real, mincapwidth=]lcccc \tnote[a]2 sentences ignored \tnote[b]15 sentences ignored \tnote[c]1178 sentences ignored \tnote[d]Mean pairwise Jaccard similarity \FLCorpus Align $\alpha_{plain}$ Align $\alpha_{diff}$ Align $\alpha_{norm}$ Align LAS \MLNDT 1 Align 98.4 Align 93.0 Align 98.8 Align 94.0 \NNNDT 2 Align 98.9 Align 95.0 Align 99.1 Align 94.4 \NNNDT 3 Align 97.9 Align 91.2 Align 98.7 Align 95.3 \MLCDT (da) Align 95.7 Align 84.7 Align 96.2 Align 90.4 \NNCDT (en) Align 92.4 Align 70.7 Align 95.0 Align 88.4 \NNCDT (es) Align 86.6 Align 48.8 Align 85.8 Align 78.9\tmark[a] \NNCDT (it) Align 84.5 Align 55.7 Align 89.2 Align 81.3\tmark[b] \MLPCEDT Align 95.9 Align 89.9 Align 96.5 Align 68.0\tmark[c] \MLSSD Align 99.1 Align 98.6 Align 99.3 Align 87.9\tmark[d] \LL

As our uncorrected metric for comparing two phrase structure trees we do not use the traditional bracket $F_{1}$ as it does not generalise well to more than two annotators, but rather Jaccard similarity. The Jaccard similarity of two sets $A$ and $B$ is the ratio of the size of their intersection to the size of their union: $J(A,B)=\nicefrac{{|A\cap B|}}{{|A\cup B|}}$ , and we use the Jaccard similarity of the sets of labelled bracketings of two trees as our uncorrected measure. To compute the similarity for a complete set of annotations we use the mean pairwise Jaccard similarity weighted by sentence length; that is, the same procedure as in 3, but using Jaccard similarity rather than LAS.

Since LAS assumes that both of the sentences compared have identical sets of tokens, we had to exclude a number of sentences from the LAS computation in the cases of the English and Italian CDT corpora, and especially the PCEDT. The large number of sentences excluded in the PCEDT is due to the fact that in the tectogrammatical analysis of the PCEDT, inserting and deleting nodes is an important part of the annotation task.

Looking at the results in Table LABEL:tbl:alpha-real, we observe two things. Most obvious, is the extremely large gap between the LAS and $\alpha$ metrics for the PCEDT data. However, there is a more subtle point; the orderings of the corpora by the different metrics are not the same. LAS order the corpora NDT 3, 2, 1, CDT da, en, it, es, PCEDT, whereas $\alpha_{diff}$ and $\alpha_{norm}$ gives the order NDT 2, 1, 3, PCEDT, CDT da, en, it, es, and $\alpha_{plain}$ gives the same order as the other alphas but with CDT es and it changing places. Furthermore, as the scatterplot in Figure 6 shows, there is a clear correlation between the $\alpha$ metrics and LAS, if we disregard the PCEDT results.

Figure 6: Correlation of LAS with

\alpha

The reason the PCEDT gets such low LAS is essentially the same as the reason many sentences had to be excluded from the computation in the first place; since inserting and deleting nodes is an integral part of the tectogrammatical annotation task, the assumption implicit in the LAS computation that sentences with the same number of nodes have the same nodes in the same order is obviously false, resulting in a very low LAS.

The corpus that scores the highest for all three metrics is the SSD corpus; the reason for this is uncertain, as our corpora differ along many dimensions, but the fact that the annotation was done by professional linguists who are very familiar with the grammar used to parse the data is likely a contributing factor. The difference between the $\alpha$ metrics and the Jaccard similarity is larger than the difference between $\alpha$ and LAS for our dependency corpora, however the two similarity metrics are not comparable, and it is well known that for phrase structures single disagreements such as a PP-attachment disagreement can result in multiple disagreeing bracketings.

5 Conclusion

The most important conclusion we draw from this work is the most appropriate agreement metric for syntactic annotation. First of all, we disqualify the LAS metric, primarily due to the methodological inadequacies of using an uncorrected measure. While our experiments did not reveal any serious shortcomings (unlike those of [] who in the case of categorisation showed that for large $p$ the uncorrected measure can be increasing), the methodological problems of uncorrected metrics makes us wary of LAS as an agreement metric. Next, of the three $\alpha$ metrics, $\alpha_{plain}$ is clearly the best; $\alpha_{diff}$ is extremely sensitive to even moderate amounts of disagreement, while $\alpha_{norm}$ is overly lenient.

Looking solely at Figure 3, one might be led to believe that LAS and $\alpha_{plain}$ are interchangeable, but this is not the case. As shown by Figures 4 and 5, the paraboloid shape of the LAS curve in Figure 3 is simply the combination of the metric’s linear responses to both label and structural perturbations. The behaviour of $\alpha$ on the other hand is more complex, with structural noise being penalised harder than perturbations of the labels. Thus, the similarity of LAS and $\alpha_{plain}$ is not at all assured when the amounts of structural and labelling disagreements differ. Additionally, we consider this imbalanced weighting of structural and labelling disagreements a benefit, as structure is the larger part of syntactic annotation compared to the labelling of the dependencies/bracketings. Finally our experiments show that $\alpha$ is a single metric that is applicable to both dependencies and phrase structure trees.

Furthermore, $\alpha$ metrics are far more flexible than simple accuracy metrics. The use of a distance function to define the metric means that more fine-grained distinctions can be made; for example, if the set of labels on the structures is highly structured, partial credit can be given for differing annotations that overlap. For example, if different types of adverbials (temporal, negation, etc.) receive different relations, as is the case in the Swedish Talbanken05 [] corpus, confusion of different adverbial types can be given less weight than confusion between subject and object. The $\alpha$ -based metrics are also far easier to apply to a more complex annotation task such as the tectogrammatical annotation of the PCEDT. In this task inserting and deleting nodes is an integral part of the annotation, and if two annotators insert or delete different nodes the all-or-nothing requirement of identical yield of the LAS metric makes it impossible as an evaluation metric in this setting.

5.1 Future work

In future work, we would like to investigate the use of other distance functions, in particular the use of approximate tree edit distance functions such as the $p q$ -gram algorithm []. For large data sets such as the PCEDT set used in this work, computing $\alpha$ with tree edit distance as the distance measure can take a very long time.⁸⁸The Python implementation used in this work, using NumPy and the PyPy compiler, took seven and a half hours compute a single $\alpha$ for the PCEDT data set on an Intel Core i7 2.9 GHz computer. The program is single-threaded. This is due to the fact that $\alpha$ requires $O(n^{2})$ comparisons to be made, each of which is $O(n^{2})$ using our current approach. The problem of directed graph edit distance is NP-hard, which means that to apply our method to HPSG analyses directly approximate algorithms are a requirement.

Another avenue for future work is improved synthetic experiments. As we saw, our implementation of tree perturbations was biased towards trees similar in shape to the source tree, and an improved permutation algorithm may reveal interesting edge-case behaviour in the metrics. A method for perturbing phrase structure trees would also be interesting, as this would allow us to repeat the synthetic experiments performed here using phrase structure corpora to compare the behaviour of the metrics on the two types of corpus.

Finally, annotator modelling techniques like that presented in \citeNPas:Car13 has obvious advantages over agreement coefficients such as $\alpha$ . These techniques are interpreted more easily than agreement coefficients, and they allow us to assess the quality of individual annotators, a crucial property in crowd-sourcing settings and something that’s impossible using agreement coefficients.

Acknowledgements

I would like to thank Jan Štěpánek at Charles University for data from the PCEDT and help with the conversion process, the CDT project for publishing their agreement data, Per Erik Solberg at the Norwegian National Library for data from the NDT, and Emily Bender at the University of Washington for the SSD data.

References

Generated on Tue Jun 10 18:12:45 2014 by LaTeXML [LOGO]