Indexing schemes for similarity search in datasets of short protein fragments

Authors:
Aleksandar Stojmirović;Vladimir Pestov
Affiliations:
Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Ont., Canada K1N 6N5;Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Ont., Canada K1N 6N5
Venue:
Information Systems
Year:
2007

Citing 24
Cited 5

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
On the geometry of similarity search: dimensionality curse and concentration of measure

Information Processing Letters
Trie memory

Communications of the ACM
The "DGX" distribution for mining massive, skewed data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Searching in metric spaces

ACM Computing Surveys (CSUR)
Searching in metric spaces with user-defined and approximate distances

ACM Transactions on Database Systems (TODS)
Processing Complex Similarity Queries with Distance-Based Access Methods

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Constructing Suffix Trees On-Line in Linear Time

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Better Filtering with Gapped q-Grams

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
An Assessment of a Metric Space Database Index to Support Sequence Homology

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
The ed-tree: an index for large DNA sequence databases

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
Indexing schemes for similarity search: an illustrated paradigm

Fundamenta Informaticae

Ranking through integration of protein-similarity for identification of cell-cyclic genes

International Journal of Bioinformatics Research and Applications
Indexability, concentration, and VC theory

Proceedings of the Third International Conference on SImilarity Search and APplications
Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions

Proceedings of the Fourth International Conference on SImilarity Search and APplications
Indexability, concentration, and VC theory

Journal of Discrete Algorithms
On aggregation of normed structures

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than 1% of the entire dataset.