Prefix tree indexing for similarity search and similarity joins on genomic data

  • Authors:
  • Astrid Rheinländer;Martin Knobloch;Nicky Hochmuth;Ulf Leser

  • Affiliations:
  • Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany;Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany;Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany;Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany

  • Venue:
  • SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our tool on several collections of long strings containing up to 5,000,000 entries of length up to 3,500. We compare its performance to agrep, nrgrep, and user-defined functions inside a relational database. Our experiments reveal that PETER is faster by orders of magnitudes compared to the command-line tools. Compared to RDBMS, it computes similarity joins in minutes for which UDFs did not finish within a day and outperforms the built-in join methods even in the exact case.