Reference sequence construction for relative compression of genomes

  • Authors:
  • Shanika Kuruppu;Simon J. Puglisi;Justin Zobel

  • Affiliations:
  • Department of Computer Science & Software Engineering, University of Melbourne, Australia;Department of Informatics, King's College London, United Kingdom;Department of Computer Science & Software Engineering, University of Melbourne, Australia

  • Venue:
  • SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Relative compression, where a set of similar strings are compressed with respect to a reference string, is an effective method of compressing DNA datasets containing multiple similar sequences. Moreover, it supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In this paper, we explore using the dictionary of repeats generated by COMRAD, RE-PAIR and DNA-X algorithms as reference sequences for relative compression. We show that this technique allows for better compression, and allows more general repetitive datasets to be compressed using relative compression.