A retargetable table reader

  • Authors:
  • John H. Shamilian;Henry S. Baird;Thomas L. Wood

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this way differ crucially from fixed forms. Our system performs these steps: copes with multiple tables per page; identifies records within tables; segments records into fields; and recognizes characters within fields, constrained by field-specific contextual knowledge. Obstacles to good performance on tables include small print, tight line-spacing, poor-quality text (such as photocopies), and line-art or background patterns that touch the text. Precise skew-correction and pitch-estimation, and high-performance OCR using neural nets proved crucial in overcoming these obstacles. The most significant technical advances in this work appear to be algorithms for identifying and segmenting records with known layout, and integration of these algorithms with a graphical user interface (GUI) for defining new layouts. This GUI has been ergonomically designed to make efficient and intuitive use of exemplary images, so that the skill and manual effort required to retarget the system to new table layouts are held to a minimum. The system has been applied in this way to more than 400 distinct tabular layouts. During the last three years the system has read over fifty million records with high accuracy.