A Study on the Document Zone Content Classification Problem

  • Authors:
  • Yalin Wang;Ihsin T. Phillips;Robert M. Haralick

  • Affiliations:
  • -;-;-

  • Venue:
  • DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones. We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from the UWCDROM-III document image database. With a total of 24, 177 zones within the data set, the cross-validation method was used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 4-18pt and font size 19-32 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.