Reference-based indexing of sequence databases

Authors:
Jayendra Venkateswaran;Deepak Lachwani;Tamer Kahveci;Christopher Jermaine
Affiliations:
CISE Department, University of Florida, Gainesville, FL;CISE Department, University of Florida, Gainesville, FL;CISE Department, University of Florida, Gainesville, FL;CISE Department, University of Florida, Gainesville, FL
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 18
Cited 14

Algorithms for approximate string matching

Information and Control
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
How to improve the pruning ability of dynamic metric access methods

Proceedings of the eleventh international conference on Information and knowledge management
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Fast and Practical Approximate String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Speeding up whole-genome alignment by indexing frequency vectors

Bioinformatics
DSIM: A Distance-Based Indexing Method for Genomic Sequences

BIBE '05 Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Approximate embedding-based subsequence matching of time series

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Reference-based indexing for metric spaces with costly distance measures

The VLDB Journal — The International Journal on Very Large Data Bases
Optimal incremental multi-step nearest-neighbor search

Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems
Fast shortest path distance estimation in large networks

Proceedings of the 18th ACM conference on Information and knowledge management
Maximal metric margin partitioning for similarity search indexes

Proceedings of the 18th ACM conference on Information and knowledge management
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space

Knowledge and Information Systems
Motion retrieval based on an efficient index method for large-scale mocap database

ICDHM'07 Proceedings of the 1st international conference on Digital human modeling
Effectiveness of optimal incremental multi-step nearest neighbor search

Expert Systems with Applications: An International Journal
Squeezing long sequence data for efficient similarity search

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Selecting vantage objects for similarity indexing

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Embedding-based subsequence matching in time-series databases

ACM Transactions on Database Systems (TODS)
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
Finding representative objects using link analysis ranking

Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of similarity search in a very large sequence database with edit distance as the similarity measure. Given limited main memory, our goal is to develop a reference-based index that reduces the number of costly edit distance computations in order to answer a query. The idea in reference-based indexing is to select a small set of reference sequences that serve as a surrogate for the other sequences in the database. We consider two novel strategies for selecting references as well as a new strategy for assigning references to database sequences. Our experimental results show that our selection and assignment methods far outperform competitive methods. For example, our methods prune up to 20 times as many sequences as the Omni method, and as many as 30 times as many sequences as frequency vectors. Our methods also scale nicely for databases containing many and/or very long sequences.