Digital weight watching: reconstruction of scanned documents

  • Authors:
  • Tim Gielissen;Maarten Marx

  • Affiliations:
  • University of Amsterdam, Amsterdam, The Netherlands;University of Amsterdam, Amsterdam, The Netherlands

  • Venue:
  • Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scanned and OCRed data leads to large file sizes if facsimile images are included. This makes storage of, and providing online access to large data sets costly. Manually analyzing such data is cumbersome because of long download and processing times. It may thus be advantageous to reconstruct the scanned documents as documents without scanned images which nevertheless closely resemble the original. We have done this reconstruction for a data set of Dutch parliamentary proceedings with positive results. 1.5% of the original storage space was needed, while the documents resembled the originals to a high degree. We describe the reconstruction process and evaluate the costs, the benefits and the quality.