Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Distance-based indexing for high-dimensional metric spaces
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Data structures and algorithms for nearest neighbor search in general metric spaces
SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Reducing the space requirement of suffix trees
Software—Practice & Experience
On the geometry of similarity search: dimensionality curse and concentration of measure
Information Processing Letters
Communications of the ACM
The "DGX" distribution for mining massive, skewed data
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
ACM Computing Surveys (CSUR)
Searching in metric spaces with user-defined and approximate distances
ACM Transactions on Database Systems (TODS)
Processing Complex Similarity Queries with Distance-Based Access Methods
EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Constructing Suffix Trees On-Line in Linear Time
Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases
Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences
Proceedings of the 27th International Conference on Very Large Data Bases
Better Filtering with Gapped q-Grams
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
An Assessment of a Metric Space Database Index to Support Sequence Homology
BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Index-driven similarity search in metric spaces (Survey Article)
ACM Transactions on Database Systems (TODS)
The ed-tree: an index for large DNA sequence databases
SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
Indexing schemes for similarity search: an illustrated paradigm
Fundamenta Informaticae
Ranking through integration of protein-similarity for identification of cell-cyclic genes
International Journal of Bioinformatics Research and Applications
Indexability, concentration, and VC theory
Proceedings of the Third International Conference on SImilarity Search and APplications
Proceedings of the Fourth International Conference on SImilarity Search and APplications
Indexability, concentration, and VC theory
Journal of Discrete Algorithms
On aggregation of normed structures
Mathematical and Computer Modelling: An International Journal
Hi-index | 0.00 |
We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than 1% of the entire dataset.