The IMPACT dataset of historical document images

Authors:
Christos Papadopoulos;Stefan Pletschacher;Christian Clausner;Apostolos Antonacopoulos
Affiliations:
University of Salford, Greater Manchester, United Kingdom;University of Salford, Greater Manchester, United Kingdom;University of Salford, Greater Manchester, United Kingdom;University of Salford, Greater Manchester, United Kingdom
Venue:
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Year:
2013

Citing 6
Cited 0

The UvA color document dataset

International Journal on Document Analysis and Recognition
A Realistic Dataset for Performance Evaluation of Document Layout Analysis

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
The PAGE (Page Analysis and Ground-Truth Elements) Format Framework

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Historical Document Layout Analysis Competition

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
A Fast Alignment Scheme for Automatic OCR Evaluation of Books

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Representative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource which is the result of a large scale effort undertaken in the EU funded project IMPACT. A number of challenges faced during the creation phase but also the significant benefits and potential of this collection of printed historical documents are described. The dataset contains over 600,000 document images that originate from major European libraries and are representative of both their respective holdings and digitisation plans for the near to medium term. It is truly unique with regard to the very substantial amount of high-quality ground truth which is available for approximately 45,000 pages, capturing detailed layout, reading order and text content. The dataset is publicly available through the IMPACT Centre of Competence (www.digitisation.eu).