Compressing dynamic text collections via phrase-based coding

  • Authors:
  • Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;José R. Paramá

  • Affiliations:
  • Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain;Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain;Dept. of Computer Science, Univ. de Chile, Santiago, Chile;Database Lab., Univ. da Coruña, Facultade de Informática, A Coruña, Spain

  • Venue:
  • ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new statistical compression method, which we call Phrase Based Dense Code (PBDC), aimed at compressing large digital libraries. PBDC compresses the text collection to 30–32% of its original size, permits maintaining the text compressed all the time, and offers efficient on-line information retrieval services. The novelty of PBDC is that it supports continuous growing of the compressed text collection, by automatically adapting the vocabulary both to new words and to changes in the word frequency distribution, without degrading the compression ratio. Text compressed with PBDC can be searched directly without decompression, using fast Boyer-Moore algorithms. It is also possible to decompress arbitrary portions of the collection. Alternative compression methods oriented to information retrieval focus on static collections and thus are less well suited to digital libraries.