Creation of textual versions of historical documents from polish digital libraries

  • Authors:
  • Adam Dudczak;Miłosz Kmieciak;Marcin Werla

  • Affiliations:
  • Poznań Supercomputing and Networking Center, Poznań, Poland;Poznań Supercomputing and Networking Center, Poznań, Poland;Poznań Supercomputing and Networking Center, Poznań, Poland

  • Venue:
  • TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the results of initial work aimed at increasing the number and improving the quality of textual versions of the historical documents available in Polish digital libraries. Digital libraries community is missing tools that integrate existing digitisation workflow with customizable OCR engine and crowd---based text correction, this paper describes work on providing such a solution. Apart from today's state of the art in this field, this paper includes a description of the Virtual Transcription Laboratory (VTL) prototype, a crowdsourcing platform that utilize the Tesseract OCR engine. The last chapter outlines results of the prototype's evaluation on real life dataset of historical documents from the IMPACT project. Results prove the applicability of the proposed solution as an enhancement of the digitisation workflow.