Transforming Japanese archives into accessible digital books
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Quality assurance for document image collections in digital preservation
ACIVS'12 Proceedings of the 14th international conference on Advanced Concepts for Intelligent Vision Systems
Hi-index | 0.00 |
The Google Newspaper Search program was launched on September 8, 2008. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google searches. Thus, in response to a Google query “Hitler death”, we are able to show newspaper articles from the very day it was reported, authentic and unbiased by passage of time. Non-uniform illumination, presence of significant noise, tears and scratches in the microfilm image, all pose special challenges for this project. The significant variation of layouts across newspapers and time eras, the variations in font sizes occurring in a single page (which confuses the OCR engine) compound the difficulties. The project is still going on after the initial launch was made (with about 15 million news articles).