Prefix tree indexing for similarity search and similarity joins on genomic data

Authors:
Astrid Rheinländer;Martin Knobloch;Nicky Hochmuth;Ulf Leser
Affiliations:
Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany;Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany;Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany;Humboldt-Universität zu Berlin, Department of Computer Science, Berlin, Germany
Venue:
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Year:
2010

Citing 15
Cited 4

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
The String-to-String Correction Problem

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Tries for Approximate String Matching

IEEE Transactions on Knowledge and Data Engineering
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Data Mining and Knowledge Discovery
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Using prefix-trees for efficiently computing set joins

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Scalable sequence similarity search and join in main memory on multi-cores

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Efficient similarity search in very large string sets

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Trying to outperform a well-known index with a sequential scan

Proceedings of the Joint EDBT/ICDT 2013 Workshops
RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our tool on several collections of long strings containing up to 5,000,000 entries of length up to 3,500. We compare its performance to agrep, nrgrep, and user-defined functions inside a relational database. Our experiments reveal that PETER is faster by orders of magnitudes compared to the command-line tools. Compared to RDBMS, it computes similarity joins in minutes for which UDFs did not finish within a day and outperforms the built-in join methods even in the exact case.