IMPACT: centre of competence in text digitisation

  • Authors:
  • Hildelies Balk;Aly Conteh

  • Affiliations:
  • KB National Library of European Projects/director IMPACT Project, The Hague, The Netherlands;The British Library, London, United Kingdom

  • Venue:
  • Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A major focus of recent large scale digitisation initiatives has been historical texts, primarily in the form of out-of-copyright newspapers and books. However, the Optical Character Recognition (OCR) software used to translate the scanned images to machine-readable text does not provide satisfactory results for historical documents. This is due to issues inherent in the material such as warped pages, bleed-through, historical fonts, broken and irregular characters, complex layouts, and spelling variants. In the large scale project Improving Access to Text (IMPACT), a European team of scientists, industry partners and digitisation professionals have been working together to enhance existing and develop new approaches to the extraction of text content from historical documents. The project facilitates a successful collaboration between digitisation professionals, based at institutions digitising millions of historical text documents, and scientists in document analysis, language technologies and OCR. This session will detail the work of IMPACT in the context of real life problems faced in the large scale digitisation programmes of libraries and the legacy that the project will leave to foster further research in advancing the state of the art in extracting textual content from historical documents.