Efficient direct search on compressed genomic data

Authors:
Xiaohui Xie;Xiaochun Yang;Jiaying Wang;Bin Wang;Chen Li
Affiliations:
Department of Computer Science, University of California, Irvine, CA 92697;College of Information Science and Engineering, Northeastern University, Liaoning 110819 China;College of Information Science and Engineering, Northeastern University, Liaoning 110819 China;College of Information Science and Engineering, Northeastern University, Liaoning 110819 China;Department of Computer Science, University of California, Irvine, CA 92697
Venue:
ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)
Year:
2013

Citing 0
Cited 1

RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data.