An efficient algorithm for finding similar short substrings from large scale string data

Authors:
Takeaki Uno
Affiliations:
National Institute of Informatics, Tokyo, Japan
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 8
Cited 1

Generalized string matching

SIAM Journal on Computing
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Faster algorithms for string matching with k mismatches

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Simple and Practical Sequence Nearest Neighbors with Block Operations

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Computing Highly Specific and Mismatch Tolerant Oligomers Efficiently

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Accelerated off-target search algorithm for siRNA

Bioinformatics

Scalable detection of frequent substrings by grammar-based compression

DS'11 Proceedings of the 14th international conference on Discovery science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straightforward algorithms do not work for practical huge databases because of their computation time of high degree order. This paper addresses the problems of finding pairs of strings with small Hamming distances from huge databases composed of short strings. By solving the problem for all the substrings of fixed length, we can efficiently find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm running in almost linear time of the database size. We prove that the computation time of its variant is bounded by linear of the database size when the length of short strings to be found is constant. Slight modifications of the algorithm adapt to the edit distance and mismatch tolerance computation. Computational experiments for genome sequences show the efficiency of the algorithm. An implementation is available at the author's homepage.