Document engineering for a digital library: PDF recompression using JBIG2 and other optimizations of PDF documents

  • Authors:
  • Petr Sojka;Radim Hatlapatka

  • Affiliations:
  • Masaryk University, Brno, Czech Rep;Masaryk University, Brno, Czech Rep

  • Venue:
  • Proceedings of the 10th ACM symposium on Document engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes several innovative document transformations and tools that have been developed in the process of building the Digital Mathematical Library DML-CZ http://dml.cz. The main result presented in this paper is our PDF re-compression tools developed using a jbig2enc library. Together with other programs, especially pdfsizeopt.py by Péter Szabó, we have managed to decrease PDF storage size and transmission needs be 62%: using both programs we reduced the size of the original PDFs to 38%. This paper briefly describes other approaches and tools developed while creating the digital library. The batch digital signature stamper, the document similarity metrics which uses four different methods, a [meta]data validation process and some math OCR tools represent some of the main byproducts of this project. These ways of document engineering, together with Google Scholar indexing optimizations have led to the success of serving digitized and born-digital scientific math documents to the public in DML=CZ, and will be employed also in the project of The European Digital Mathematics Library, EuDML.