Binarization and multithresholding of document images using connectivity
CVGIP: Graphical Models and Image Processing
Document Image Binarization Based on Texture Features
IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Machine Printed Text and Handwriting Identification in Noisy Document Images
IEEE Transactions on Pattern Analysis and Machine Intelligence
Iterated Document Content Classification
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms
IEEE Transactions on Pattern Analysis and Machine Intelligence
The Convergence of Iterated Classification
DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Markov Random Field Based Text Identification from Annotated Machine Printed Documents
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
A framework for the assessment of text extraction algorithms on complex colour images
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
An analysis of binarization ground truthing
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Towards versatile document analysis systems
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Hi-index | 0.00 |
Versatile algorithms for document image content extraction (DICE) were investigated in [1, 2, 3, 4]. That is, to extract the image layers that contain the contents of interests, such as handwriting, machine-print text, photographs and blank, etc. The DICE classifier based on tight ground truth data can delimit the regions of interests approximately. In this paper, taking the result of DICE classifier as the input, we extended the work by trying to completely separate the pixels of characters from the background and the other contents using image post-processing techniques and pattern recognition methods. First of all, we applied the color space analysis on the detected text regions. Then we segmented the image into regions (connected components) that contain pixels of similar colors and content labels, and generated patches containing multiple connected components that are within a selected distance to their neighbors. Finally we classified the generated patches using the structure features and DICE labels. The preliminary experiment results of the proposed model are promising.