An Optimization Methodology for Document Structure Extraction on Latin Character Documents

  • Authors:
  • Jiseng Liang;Ihsin T. Phillips;Robert M. Haralick

  • Affiliations:
  • Insightful Corp., Seattle, WA;Queens College, City Univ. of New York, Flushing, NY;Graduate Center, City Univ. of New York, New York, NY

  • Venue:
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Year:
  • 2001

Quantified Score

Hi-index 0.14

Visualization

Abstract

In this paper, we give a formal definition of a document image structure representation and we formulate document image structure extraction as a partitioning problem: Finding an optimal solution partitioning the set of glyphs of an input document image into a hierarchical tree structure where entities within the hierarchy at each level have similar physical properties and compatable semantic labels. We present a unified methodology that is applicable to construction of document structures at different hierarchical levels. An iterative, relaxation-like method is used to find a partitioning solution that maximizes the probability of the extracted structure. All the probabilities used in the partioning process are estimated from an extensive training set of various kinds of measurements among the entities within the hierarchy. The offline probabilities estimated in the training then drive all decisions in the online document structure extraction. We have implemented a text line extraction algorithm using this framework. The algorithm was evaluated on the UW-III database of some 1,600 scanned document image pages. An area-overlap measure is used to find the correspondence between the detected entities and the ground-truth. For a total of 105,020 text lines, the text line extraction algorithm identifies and segments 104,773 correctly, an accuracy of 99.76 percent. The detail of the algorithm is presented in this paper.