Geometric Structure Analysis of Document Images: A Knowledge-Based Approach
IEEE Transactions on Pattern Analysis and Machine Intelligence
Maintaining knowledge about temporal intervals
Communications of the ACM
Automated QA for Document Understanding Systems
IEEE Software
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Web-based Cooperative Document Understanding
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Digitizing cultural heritage manuscripts: the Bovary project
Proceedings of the 2003 ACM symposium on Document engineering
The lifecycle of a digital historical document: structure and content
Proceedings of the 2004 ACM symposium on Document engineering
PerfectDoc: A Ground Truthing Environment for Complex Documents
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Identification of Document Structure and Table of Content in Magazine Archives
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Variable-order N-gram generation by word-class splitting and consecutive word grouping
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
GroundTruth tools & technology: applications in real world
Proceedings of the 2005 ACM symposium on Document engineering
Logical document conversion: combining functional and formal knowledge
Proceedings of the 2007 ACM symposium on Document engineering
Page frame detection for double page document images
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Logical segmentation for article extraction in digitized old newspapers
Proceedings of the 2012 ACM symposium on Document engineering
Hi-index | 0.00 |
The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.