Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

Authors:
Takeaki Uno
Affiliations:
National Institute of Informatics, 2-1-2, Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Venue:
Knowledge and Information Systems - Special Issue:Best Papers from the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2008);Guest Editors: Takashi Washio, Einoshin Suzuki and Kai Ming Ting
Year:
2010

Citing 0
Cited 3

Distributed learning with data reduction

Transactions on computational collective intelligence IV
A new cluster-based instance selection algorithm

KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
Experimental evaluation of the agent-based population learning algorithm for the cluster-based instance selection

ICCCI'11 Proceedings of the Third international conference on Computational collective intelligence: technologies and applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.