Image processing for historical newspaper archives

Authors:
Takahiro Shima;Kengo Terasawa;Toshio Kawashima
Affiliations:
Renesas Micro Systems Co., Ltd., Sapporo, Hokkaido, Japan;Future University Hakodate, Hakodate, Hokkaido, Japan;Future University Hakodate, Hakodate, Hokkaido, Japan
Venue:
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Year:
2011

Citing 5
Cited 0

A probabilistic Hough transform

Pattern Recognition
Word Spotting: A New Approach to Indexing Handwriting

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Eigenspace Method for Text Retrieval in Historical Document Images

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
A comprehensive evaluation methodology for noisy historical document recognition techniques

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Slit Style HOG Feature for Document Image Word Spotting

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents some image processing methods that could produce accurate character segmentation results for historical newspaper archives. A full text search using a word spotting technique is no doubt a promising approach in order to facilitate the utilization of digital archives. Some word spotting techniques require the target images to be segmented into character images in advance, however character segmentation is a difficult issue especially for old and degraded document images. This paper figures out the causes that make the character segmentation difficult, and removes them in order to improve the accuracy of character segmentation. We first detect the ruled lines using Hough Transform in order to segment a whole newspaper image into column-separated images. Then we remove the ruled lines as well as ruby characters and noise. The proposed system is tested for 20 column-separated images of historical newspapers, and the accuracy of character segmentation is improved to 96.3%.