A histogram-based technique for automatic threshold assessment in a run length smoothing-based algorithm

  • Authors:
  • Stefano Ferilli;Teresa M. A. Basile;Floriana Esposito

  • Affiliations:
  • Università degli Studi di Bari, Bari (BA) -- Italia;Università degli Studi di Bari, Bari (BA) -- Italia;Università degli Studi di Bari, Bari (BA) -- Italia

  • Venue:
  • DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document layout analysis is crucial in the automatic document processing workflow, because its outcome affects all subsequent processing steps. A first problem concerns the possibility of dealing not only with documents having easy layout, but with so-called non-Manhattan layout documents as well. Another problem is that most available techniques can be applied to scanned document, due to the emphasis in previous decades being put on legacy documents digitization. Conversely, nowadays most documents come directly in digital format, and thus new techniques must be developed. A famous approach proposed in the literature for layout analysis was the RLSA, suitable to scanned black&white images and based the application of Run Length Smoothing and the AND logical operator. A recent variant thereof is based on the application of the OR operator, for which reason has been called RLSO. It exploits a bottom-up approach that proved able to handle even non-Manhattan layouts, on both scanned and natively digital documents. Like RLSA, it is based on the definition of thresholds for the smoothing operator, but the different approach requires different criteria than those that work in RLSA to define proper values. Since this is a hard and unnatural task for an (even expert) user, this paper proposes a technique to automatically define such thresholds for each single document, based on the distribution of spacing therein. Application on selected samples of documents, that aimed at covering a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size. It can provide a useful basis also for handling more complex cases.