Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Situated Cognition: On Human Knowledge and Computer Representations
Situated Cognition: On Human Knowledge and Computer Representations
Relevance assessment: are judges exchangeable and does it matter
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Inter-coder agreement for computational linguistics
Computational Linguistics
Combining vocabulary alignment techniques
Proceedings of the fifth international conference on Knowledge capture
An empirical study of instance-based ontology matching
ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
When owl: sameAs isn't the same: an analysis of identity in linked data
ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
Aligning large SKOS-Like vocabularies: two case studies
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I
Reflections on 25+ years of knowledge acquisition
International Journal of Human-Computer Studies
Situated cognition and knowledge acquisition research
International Journal of Human-Computer Studies
Knowledge acquisition and the web
International Journal of Human-Computer Studies
Hi-index | 0.00 |
Gold standard mappings created by experts are at the core of alignment evaluation. At the same time, the process of manual evaluation is rarely discussed. While the practice of having multiple raters evaluate results is accepted, their level of agreement is often not measured. In this paper we describe three experiments in manual evaluation and study the way different raters evaluate mappings. We used alignments generated using different techniques and between vocabularies of different type. In each experiment, five raters evaluated alignments and talked through their decisions using the think aloud method. In all three experiments we found that inter-rater agreement was low and analyzed our data to find the reasons for it. Our analysis shows which variables can be controlled to affect the level of agreement including the mapping relations, the evaluation guidelines and the background of the raters. On the other hand, differences in the perception of raters, and the complexity of the relations between often ill-defined natural language concepts remain inherent sources of disagreement. Our results indicate that the manual evaluation of ontology alignments is by no means an easy task and that the ontology alignment community should be careful in the construction and use of reference alignments.