Document digitization lifecycle for complex magazine collection

  • Authors:
  • Sherif Yacoub;John Burns;Paolo Faraboschi;Daniel Ortega;Jose Abad Peiro;Vinay Saxena

  • Affiliations:
  • Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, USA

  • Venue:
  • Proceedings of the 2005 ACM symposium on Document engineering
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.