Simple and Practical Sequence Nearest Neighbors with Block Operations

Authors:
S. Muthukrishnan;Süleyman Cenk Sahinalp
Affiliations:
-;-
Venue:
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Year:
2002

Citing 10
Cited 7

Block edit models for approximate string matching

Theoretical Computer Science - Special issue: Latin American theoretical informatics
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Sorting by Transpositions

SIAM Journal on Discrete Mathematics
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Communication complexity of document exchange

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A new approach to sequence comparison: normalized sequence alignment

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Permutation Editing and Matching via Embeddings

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Edit Distance with Move Operations

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Efficient approximate and dynamic matching of patterns using a labeling paradigm

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science

Computing Highly Specific and Mismatch Tolerant Oligomers Efficiently

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Efficient algorithms for substring near neighbor problem

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Edit distance with move operations

Journal of Discrete Algorithms
Gestures are strings: efficient online gesture spotting and classification using string matching

Proceedings of the ICST 2nd international conference on Body area networks
Faster dimension reduction

Communications of the ACM
An efficient algorithm for finding similar short substrings from large scale string data

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining

Quantified Score

Hi-index	0.02

Visualization

Abstract

Sequence nearest neighbors problemc an be defined as follows. Given a database D of n sequences, preprocess D so that given any query sequence Q, one can quickly find a sequence S in D for which d(S,Q) 驴 d(S, T) for any other sequence T in D. Here d(S,Q) denotes the "distance" between sequences S and Q, which can be defined as the minimum number of "edit operations" to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks).One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search. Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard.The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this paper is the block edit distance. This method achieves a preprocessing time and space polynomial in size of D and query time near-linear in size of Q by allowing an approximate factor of O(logl(log* l)2). The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above.In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [11] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.