Cost effective ontology population with data from lists in OCRed historical documents

  • Authors:
  • Thomas L. Packer;David W. Embley

  • Affiliations:
  • Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah

  • Venue:
  • Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

A method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would contribute to making a variety of historical knowledge machine searchable, queryable, and linkable. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose ListReader, a wrapper-induction solution for information extraction that is specialized for lists in OCRed documents. ListReader can induce either a regular-expression grammar or a Hidden Markov Model. Each can infer list structure and field labels from OCR text. We decrease the cost and improve the accuracy of the induction process using semi-supervised machine learning and active learning, allowing induction of a wrapper from almost a single hand-labeled instance per field per list. After applying an induced wrapper, ListReader automatically maps the labeled text it produces to a rich variety of ontologically structured predicates. We evaluate our implementation on family history books in terms of the typical F-measure and a new metric, "Label Efficiency", which measures both extraction quality and cost in a single number. We show with statistical significance that ListReader reaches values closer to optimal levels than a state-of-the-art statistical sequence labeler.