Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and skeleton segmentation paths

Authors:
Nikos Nikolaou;Michael Makridis;Basilis Gatos;Nikolaos Stamatopoulos;Nikos Papamarkos
Affiliations:
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67 100 Xanthi, Greece and Computational Intelligence Laboratory, Institute of Informatics and Telecommunications ...;Department of Electrical and Computer Engineering, Democritus University of Thrace, 67 100 Xanthi, Greece;Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research "Demokritos", 153 10 Athens, Greece;Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research "Demokritos", 153 10 Athens, Greece;Department of Electrical and Computer Engineering, Democritus University of Thrace, 67 100 Xanthi, Greece
Venue:
Image and Vision Computing
Year:
2010

Citing 24
Cited 10

Segmentation of page images using the area Voronoi diagram

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Empirical Performance Evaluation of Graphics Recognition Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Segmentation of Single- or Multiple-Touching Handwritten Numeral String Using Background and Foreground Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Knowledge-based English cursive script segmentation

Pattern Recognition Letters
Use of the Hough transformation to detect lines and curves in pictures

Communications of the ACM
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Touching numeral segmentation using water reservoir concept

Pattern Recognition Letters
Two Geometric Algorithms for Layout Analysis

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
A Hough based algorithm for extracting text lines in handwritten documents

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
An Algorithm for Extracting Cursive Text Lines

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
An Approach to Word Image Matching Based on Weighted Hausforff Distance

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Line Detection and Segmentation in Historical Church Registers

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
User-Assisted Archive Document Image Analysis for Digital Library Construction

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Line Separation for Complex Document Images Using Fuzzy Runlength

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
A Scale Space Approach for Automatically Segmenting Words from Historical Handwritten Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Semantics-Based Content Extraction in Typewritten Historical Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Page Segmentation Competition

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Separating Lines of Text in Free-Form Handwritten Historical Documents

DIAL '06 Proceedings of the Second International Conference on Document Image Analysis for Libraries
Text Line Extraction in Handwritten Document with Kalman Filter Applied on Low Resolution Image

DIAL '06 Proceedings of the Second International Conference on Document Image Analysis for Libraries
Detecting Text Lines in Handwritten Documents

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Adaptive degraded document image binarization

Pattern Recognition
Text line segmentation of historical documents: a survey

International Journal on Document Analysis and Recognition
User-driven page layout analysis of historical printed books

International Journal on Document Analysis and Recognition
Keyword-guided word spotting in historical printed documents using synthetic data and user feedback

International Journal on Document Analysis and Recognition

Page frame detection for double page document images

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Text detection in images using sparse representation with discriminative dictionaries

Image and Vision Computing
Text extraction using component analysis and neuro-fuzzy classification on complex backgrounds

SCIA'11 Proceedings of the 17th Scandinavian conference on Image analysis
An experimental workflow development platform for historical document digitisation and analysis

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Using adaptive run length smoothing algorithm for accurate text localization in images

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
An information extraction system from patient historical documents

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Margin noise removal from printed document images

Proceeding of the workshop on Document Analysis and Recognition
An optimization for binarization methods by removing binary artifacts

Pattern Recognition Letters
Distinction between handwritten and machine-printed text based on the bag of visual words model

Pattern Recognition
Intangible cultural heritage preservation: An exploratory study of digitization of the historical literature of Chinese Kunqu opera librettos

Journal on Computing and Cultural Heritage (JOCCH)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we strive towards the development of efficient techniques in order to segment document pages resulting from the digitization of historical machine-printed sources. This kind of documents often suffer from low quality and local skew, several degradations due to the old printing matrix quality or ink diffusion, and exhibit complex and dense layout. To face these problems, we introduce the following innovative aspects: (i) use of a novel Adaptive Run Length Smoothing Algorithm (ARLSA) in order to face the problem of complex and dense document layout, (ii) detection of noisy areas and punctuation marks that are usual in historical machine-printed documents, (iii) detection of possible obstacles formed from background areas in order to separate neighboring text columns or text lines, and (iv) use of skeleton segmentation paths in order to isolate possible connected characters. Comparative experiments using several historical machine-printed documents prove the efficiency of the proposed technique.