Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

  • Authors:
  • Shanika Kuruppu;Simon J. Puglisi;Justin Zobel

  • Affiliations:
  • National ICT Australia, Department of Computer Science & Software Engineering, University of Melbourne;School of Computer Science and Information Technology, Royal Melbourne Institute of Technology, Australia;National ICT Australia, Department of Computer Science & Software Engineering, University of Melbourne

  • Venue:
  • SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Self-indexes - data structures that simultaneously provide fast search of and access to compressed text - are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our 'RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just nHk(T) + s log n + s log N/s + O(s) bits. At the cost of negligible extra space, access to l consecutive symbols requires O(l + log n) time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.