Document digitization lifecycle for complex magazine collection

Authors:
Sherif Yacoub;John Burns;Paolo Faraboschi;Daniel Ortega;Jose Abad Peiro;Vinay Saxena
Affiliations:
Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, Spain;Hewlett-Packard, USA
Venue:
Proceedings of the 2005 ACM symposium on Document engineering
Year:
2005

Citing 10
Cited 4

Geometric Structure Analysis of Document Images: A Knowledge-Based Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
Maintaining knowledge about temporal intervals

Communications of the ACM
Automated QA for Document Understanding Systems

IEEE Software
DMOS: A Generic Document Recognition Method, Application to an Automatic Generator of Musical Scores, Mathematical Formulae and Table Structures Recognition Systems

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Web-based Cooperative Document Understanding

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Digitizing cultural heritage manuscripts: the Bovary project

Proceedings of the 2003 ACM symposium on Document engineering
The lifecycle of a digital historical document: structure and content

Proceedings of the 2004 ACM symposium on Document engineering
PerfectDoc: A Ground Truthing Environment for Complex Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Identification of Document Structure and Table of Content in Magazine Archives

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Variable-order N-gram generation by word-class splitting and consecutive word grouping

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01

GroundTruth tools & technology: applications in real world

Proceedings of the 2005 ACM symposium on Document engineering
Logical document conversion: combining functional and formal knowledge

Proceedings of the 2007 ACM symposium on Document engineering
Page frame detection for double page document images

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Logical segmentation for article extraction in digitized old newspapers

Proceedings of the 2012 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.