Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Finding motifs using random projections
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
FLASH: A Fast Look-Up Algorithm for String Homology
Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Search algorithms for biosequences using random projection
Search algorithms for biosequences using random projection
Designing seeds for similarity search in genomic DNA
RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient algorithms for substring near neighbor problem
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Optimal spaced seeds for faster approximate string matching
ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Trying to outperform a well-known index with a sequential scan
Proceedings of the Joint EDBT/ICDT 2013 Workshops
LSH-based large scale chinese calligraphic character recognition
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Hi-index | 0.00 |
The field of algorithms for pairwise biosequence similarity search is dominated by heuristic methods of high efficiency but uncertain sensitivity. One reason that more formal string matching algorithms with sensitivity guarantees have not been applied to biosequences is that they cannot directly find similarities that score highly under substitution score functions such as the DNAPAM-TT [20], PAM [9], or BLOSUM [12] families of matrices. We describe a general technique, score simulation, to map ungapped similarity search problems using these score functions into the problem of finding pairs of strings that are close in Hamming space. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity searches with formal guarantees of sensitivity using arbitrary score functions. In particular, we introduce the lsh-all-pairs-sim algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.