Reconstructing corrupt DEFLATEd files

  • Authors:
  • Ralf D. Brown

  • Affiliations:
  • Carnegie Mellon University Language Technologies Institute, 5000 Forbes Avenue, Pittsburgh PA 15213, USA

  • Venue:
  • Digital Investigation: The International Journal of Digital Forensics & Incident Response
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a method by which to determine a synchronzation point within a DEFLATE-compressed bit stream (as used in Zip and gzip archives) for which the beginning is unknown or damaged. Decompressing from the synchronization point forward yields a mixed stream of literal bytes and co-indexed unknown bytes. Language modeling in the form of byte trigrams and word unigrams is then applied to the resulting stream to infer probable replacements for each co-indexed unknown byte. Unique inferences can be made for approximately 30% of the co-indices, permitting reconstruction of approximately 75% of the unknown bytes recovered from the compressed data with accuracy in excess of 90%. The program implementing these techniques is available as open-source software.