Confidence estimation for information extraction

  • Authors:
  • Aron Culotta;Andrew McCallum

  • Affiliations:
  • University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA

  • Venue:
  • HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information extraction techniques automatically create structured databases from unstructured data sources, such as the Web or newswire documents. Despite the successes of these systems, accuracy will always be imperfect. For many reasons, it is highly desirable to accurately estimate the confidence the system has in the correctness of each extracted field. The information extraction system we evaluate is based on a linear-chain conditional random field (CRF), a probabilistic model which has performed well on information extraction tasks because of its ability to capture arbitrary, overlapping features of the input in a Markov model. We implement several techniques to estimate the confidence of both extracted fields and entire multi-field records, obtaining an average precision of 98% for retrieving correct fields and 87% for multi-field records.