Extracting person names from diverse and noisy OCR text

  • Authors:
  • Thomas L. Packer;Joshua F. Lutes;Aaron P. Stewart;David W. Embley;Eric K. Ringger;Kevin D. Seppi;Lee S. Jensen

  • Affiliations:
  • Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Brigham Young University, Provo, UT, USA;Ancestry.com, Inc., Provo, UT, USA

  • Venue:
  • AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.