Pixel accurate document image content extraction

Authors:
Siyuan Chen;Henry S. Baird
Affiliations:
Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA
Venue:
Proceedings of the 2011 ACM Symposium on Applied Computing
Year:
2011

Citing 13
Cited 0

Binarization and multithresholding of document images using connectivity

CVGIP: Graphical Models and Image Processing
Document Image Binarization Based on Texture Features

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Machine Printed Text and Handwriting Identification in Noisy Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Iterated Document Content Classification

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence
The Convergence of Iterated Classification

DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Markov Random Field Based Text Identification from Annotated Machine Printed Documents

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
A framework for the assessment of text extraction algorithms on complex colour images

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
An analysis of binarization ground truthing

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Towards versatile document analysis systems

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Versatile algorithms for document image content extraction (DICE) were investigated in [1, 2, 3, 4]. That is, to extract the image layers that contain the contents of interests, such as handwriting, machine-print text, photographs and blank, etc. The DICE classifier based on tight ground truth data can delimit the regions of interests approximately. In this paper, taking the result of DICE classifier as the input, we extended the work by trying to completely separate the pixels of characters from the background and the other contents using image post-processing techniques and pattern recognition methods. First of all, we applied the color space analysis on the detected text regions. Then we segmented the image into regions (connected components) that contain pixels of similar colors and content labels, and generated patches containing multiple connected components that are within a selected distance to their neighbors. Finally we classified the generated patches using the structure features and DICE labels. The preliminary experiment results of the proposed model are promising.