Measuring annotator agreement in a complex hierarchical dialogue act annotation scheme

Authors:
Jeroen Geertzen;Harry Bunt
Affiliations:
Tilburg University, Tilburg, The Netherlands;Tilburg University, Tilburg, The Netherlands
Venue:
SigDIAL '06 Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue
Year:
2009

Citing 6
Cited 6

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
The Trains 91 Dialogues

The Trains 91 Dialogues
The TRAINS 93 Dialogues

The TRAINS 93 Dialogues
The Monroe Corpus

The Monroe Corpus
An empirical investigation of proposals in collaborative dialogues

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
The kappa statistic: a second look

Computational Linguistics

Inter-coder agreement for computational linguistics

Computational Linguistics
Measuring coherence between electronic and manual annotations in biological databases

Proceedings of the 2009 ACM symposium on Applied Computing
Multidimensional dialogue management

SigDIAL '06 Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue
Complex linguistic annotation --- no easy way out!: a case from Bangla and Hindi POS labeling tasks

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Multifunctionality in dialogue

Computer Speech and Language
Ranked multidimensional dialogue act annotation

ESSLLI'10 Proceedings of the 2010 international conference on New Directions in Logic, Language and Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a first analysis of inter-annotator agreement for the DIT++ tagset of dialogue acts, a comprehensive, layered, multidimensional set of 86 tags. Within a dimension or a layer, subsets of tags are often hierarchically organised. We argue that especially for such highly structured annotation schemes the well-known kappa statistic is not an adequate measure of inter-annotator agreement. Instead, we propose a statistic that takes the structural properties of the tagset into account, and we discuss the application of this statistic in an annotation experiment. The experiment shows promising agreement scores for most dimensions in the tagset and provides useful insights into the usability of the annotation scheme, but also indicates that several additional factors influence annotator agreement. We finally suggest that the proposed approach for measuring agreement per dimension can be a good basis for measuring annotator agreement over the dimensions of a multidimensional annotation scheme.