Let's agree to disagree: on the evaluation of vocabulary alignment

Authors:
Anna Tordai;Jacco van Ossenbruggen;Guus Schreiber;Bob Wielinga
Affiliations:
VU University Amsterdam, Amsterdam, Netherlands;VU University Amsterdam, Amsterdam, Netherlands;VU University Amsterdam, Amsterdam, Netherlands;VU University Amsterdam, Amsterdam, Netherlands
Venue:
Proceedings of the sixth international conference on Knowledge capture
Year:
2011

Citing 8
Cited 3

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Situated Cognition: On Human Knowledge and Computer Representations

Situated Cognition: On Human Knowledge and Computer Representations
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Inter-coder agreement for computational linguistics

Computational Linguistics
Combining vocabulary alignment techniques

Proceedings of the fifth international conference on Knowledge capture
An empirical study of instance-based ontology matching

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
When owl: sameAs isn't the same: an analysis of identity in linked data

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
Aligning large SKOS-Like vocabularies: two case studies

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I

Reflections on 25+ years of knowledge acquisition

International Journal of Human-Computer Studies
Situated cognition and knowledge acquisition research

International Journal of Human-Computer Studies
Knowledge acquisition and the web

International Journal of Human-Computer Studies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gold standard mappings created by experts are at the core of alignment evaluation. At the same time, the process of manual evaluation is rarely discussed. While the practice of having multiple raters evaluate results is accepted, their level of agreement is often not measured. In this paper we describe three experiments in manual evaluation and study the way different raters evaluate mappings. We used alignments generated using different techniques and between vocabularies of different type. In each experiment, five raters evaluated alignments and talked through their decisions using the think aloud method. In all three experiments we found that inter-rater agreement was low and analyzed our data to find the reasons for it. Our analysis shows which variables can be controlled to affect the level of agreement including the mapping relations, the evaluation guidelines and the background of the raters. On the other hand, differences in the perception of raters, and the complexity of the relations between often ill-defined natural language concepts remain inherent sources of disagreement. Our results indicate that the manual evaluation of ontology alignments is by no means an easy task and that the ontology alignment community should be careful in the construction and use of reference alignments.