Pixel accurate document image content extraction

  • Authors:
  • Siyuan Chen;Henry S. Baird

  • Affiliations:
  • Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA

  • Venue:
  • Proceedings of the 2011 ACM Symposium on Applied Computing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Versatile algorithms for document image content extraction (DICE) were investigated in [1, 2, 3, 4]. That is, to extract the image layers that contain the contents of interests, such as handwriting, machine-print text, photographs and blank, etc. The DICE classifier based on tight ground truth data can delimit the regions of interests approximately. In this paper, taking the result of DICE classifier as the input, we extended the work by trying to completely separate the pixels of characters from the background and the other contents using image post-processing techniques and pattern recognition methods. First of all, we applied the color space analysis on the detected text regions. Then we segmented the image into regions (connected components) that contain pixels of similar colors and content labels, and generated patches containing multiple connected components that are within a selected distance to their neighbors. Finally we classified the generated patches using the structure features and DICE labels. The preliminary experiment results of the proposed model are promising.