Recognizing records from the extracted cells of microfilm tables

  • Authors:
  • Kenneth M. Tubbs;David W. Embley

  • Affiliations:
  • Microsoft Corporation, Redmond, WA;Brigham Young University, Provo, UT

  • Venue:
  • Proceedings of the 2002 ACM symposium on Document engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Microfilm documents contain a wealth of information, but extracting and organizing this information by hand is slow, error-prone, and tedious. As an initial step toward automating access to this information, we describe in this paper an algorithmic process to automatically identify record patterns found in microfilm tables for pre-specified application domains. Our table-processing algorithm accepts an XML input file describing the individual cells of a table taken from a microfilm document, and finds for each record in the document the cells that together comprise the record. Two key features drive the algorithm: (1) geometric layout and (2) label matching with respect to a given domain-specific application ontology. The algorithm achieved an accuracy of 92% on our test corpus of genealogical microfilm tables.