Google Newspaper Search - Image Processing and Analysis Pipeline

Authors:
Krishnendu Chaudhury;Ankur Jain;Sriram Thirthala;Vivek Sahasranaman;Shobhit Saxena;Selvam Mahalingam
Affiliations:
-;-;-;-;-;-
Venue:
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Year:
2009

Citing 0
Cited 2

Transforming Japanese archives into accessible digital books

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Quality assurance for document image collections in digital preservation

ACIVS'12 Proceedings of the 14th international conference on Advanced Concepts for Intelligent Vision Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Google Newspaper Search program was launched on September 8, 2008. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google searches. Thus, in response to a Google query “Hitler death”, we are able to show newspaper articles from the very day it was reported, authentic and unbiased by passage of time. Non-uniform illumination, presence of significant noise, tears and scratches in the microfilm image, all pose special challenges for this project. The significant variation of layouts across newspapers and time eras, the variations in font sizes occurring in a single page (which confuses the OCR engine) compound the difficulties. The project is still going on after the initial launch was made (with about 15 million news articles).