Scalable sequence similarity search and join in main memory on multi-cores

Authors:
Astrid Rheinländer;Ulf Leser
Affiliations:
Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany;Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany
Venue:
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Year:
2011

Citing 12
Cited 1

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Tries for Approximate String Matching

IEEE Transactions on Knowledge and Data Engineering
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
CloudBurst

Bioinformatics
Fast and accurate long-read alignment with Burrows–Wheeler transform

Bioinformatics
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Prefix tree indexing for similarity search and similarity joins on genomic data

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
An overview of web data clustering practices

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Efficient similarity search in very large string sets

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.