CONE: metrics for automatic evaluation of named entity co-reference resolution

Authors:
Bo Lin;Rushin Shah;Robert Frederking;Anatole Gershman
Affiliations:
Carnegie Mellon University, PA;Carnegie Mellon University, PA;Carnegie Mellon University, PA;Carnegie Mellon University, PA
Venue:
NEWS '10 Proceedings of the 2010 Named Entities Workshop
Year:
2010

Citing 14
Cited 0

A model-theoretic coreference scoring scheme

MUC6 '95 Proceedings of the 6th conference on Message understanding
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
On coreference resolution performance metrics

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
BART: a modular toolkit for coreference resolution

HLT-Demonstrations '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Understanding the value of features for coreference resolution

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Proceedings of the 4th International Workshop on Semantic Evaluations

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Transliteration alignment

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Conundrums in noun phrase coreference resolution: making sense of the state-of-the-art

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Automatically evaluating content selection in summarization without human models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Human annotation for Co-reference Resolution (CRR) is labor intensive and costly, and only a handful of annotated corpora are currently available. However, corpora with Named Entity (NE) annotations are widely available. Also, unlike current CRR systems, state-of-the-art NER systems have very high accuracy and can generate NE labels that are very close to the gold standard for unlabeled corpora. We propose a new set of metrics collectively called CONE for Named Entity Co-reference Resolution (NE-CRR) that use a subset of gold standard annotations, with the advantage that this subset can be easily approximated using NE labels when gold standard CRR annotations are absent. We define CONE B3 and CONE CEAF metrics based on the traditional B3 and CEAF metrics and show that CONE B3 and CONE CEAF scores of any CRR system on any dataset are highly correlated with its B3 and CEAF scores respectively. We obtain correlation factors greater than 0.6 for all CRR systems across all datasets, and a best-case correlation factor of 0.8. We also present a baseline method to estimate the gold standard required by CONE metrics, and show that CONE B3 and CONE CEAF scores using this estimated gold standard are also correlated with B3 and CEAF scores respectively. We thus demonstrate the suitability of CONE B3 and CONE CEAF for automatic evaluation of NE-CRR.