Google Newspaper Search - Image Processing and Analysis Pipeline

  • Authors:
  • Krishnendu Chaudhury;Ankur Jain;Sriram Thirthala;Vivek Sahasranaman;Shobhit Saxena;Selvam Mahalingam

  • Affiliations:
  • -;-;-;-;-;-

  • Venue:
  • ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Google Newspaper Search program was launched on September 8, 2008. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google searches. Thus, in response to a Google query “Hitler death”, we are able to show newspaper articles from the very day it was reported, authentic and unbiased by passage of time. Non-uniform illumination, presence of significant noise, tears and scratches in the microfilm image, all pose special challenges for this project. The significant variation of layouts across newspapers and time eras, the variations in font sizes occurring in a single page (which confuses the OCR engine) compound the difficulties. The project is still going on after the initial launch was made (with about 15 million news articles).