A histogram-based technique for automatic threshold assessment in a run length smoothing-based algorithm

Authors:
Stefano Ferilli;Teresa M. A. Basile;Floriana Esposito
Affiliations:
Università degli Studi di Bari, Bari (BA) -- Italia;Università degli Studi di Bari, Bari (BA) -- Italia;Università degli Studi di Bari, Bari (BA) -- Italia
Venue:
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Year:
2010

Citing 10
Cited 0

A Prototype Document Image Analysis System for Technical Journals

Computer
A Fast Algorithm for Bottom-Up Document Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Segmentation of page images using the area Voronoi diagram

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Two Geometric Algorithms for Layout Analysis

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Structured Document Segmentation and Representation by the Modified X-Y tree

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Page Segmentation for Manhattan and Non-Manhattan Layout Documents via Selective CRLA

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Robust Page Segmentation Based on Smearing and Error Correction Unifying Top-down and Bottom-up Approaches

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
A Distance-Based Technique for Non-Manhattan Layout Analysis

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Document analysis system

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document layout analysis is crucial in the automatic document processing workflow, because its outcome affects all subsequent processing steps. A first problem concerns the possibility of dealing not only with documents having easy layout, but with so-called non-Manhattan layout documents as well. Another problem is that most available techniques can be applied to scanned document, due to the emphasis in previous decades being put on legacy documents digitization. Conversely, nowadays most documents come directly in digital format, and thus new techniques must be developed. A famous approach proposed in the literature for layout analysis was the RLSA, suitable to scanned black&white images and based the application of Run Length Smoothing and the AND logical operator. A recent variant thereof is based on the application of the OR operator, for which reason has been called RLSO. It exploits a bottom-up approach that proved able to handle even non-Manhattan layouts, on both scanned and natively digital documents. Like RLSA, it is based on the definition of thresholds for the smoothing operator, but the different approach requires different criteria than those that work in RLSA to define proper values. Since this is a hard and unnatural task for an (even expert) user, this paper proposes a technique to automatically define such thresholds for each single document, based on the distribution of spacing therein. Application on selected samples of documents, that aimed at covering a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. It can provide a useful basis also for handling more complex cases.