An Algorithm for Finding a Common Structure Shared by a Family of Strings

Authors:
A. M. Landraud;J. F. Avril;P. Chretienne
Affiliations:
Univ. Pierre et Marie Curie, Paris, France;France;Univ. Pierre et Marie Curie, Paris, France
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1989

Citing 3
Cited 5

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Data Structures and Algorithms

Data Structures and Algorithms
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing

Extraction of Recurrent Patterns from Stratified Ordered Trees

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Analyzing the input stream for character- level errors in unconstrained text entry evaluations

ACM Transactions on Computer-Human Interaction (TOCHI)
KMRCRelat Algorithm for finding repeated words in sequences: Application on biological sequences

Journal of Computational Methods in Sciences and Engineering - Selected papers from the International Conference on Computer Science,Software Engineering, Information Technology, e-Business, and Applications, 2003
Characterization of contour regularities based on the Levenshtein edit distance

Pattern Recognition Letters
A bibliography on computational molecular biology and genetics

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	0.14

Visualization

Abstract

An algorithm is presented for extracting and localizing a common structure in a family of strings with time complexity O(N/sup 2/L/sup 2/ log/sub 2/ L) where N is the number of strings and L their maximum length. The method could be extended to two-dimensional image analysis. This structure appears as alignments of words which are similar but not necessarily identical and which occur approximately at the same location in all the strings. The method works in two successive stages. First, a fast algorithm is used for drawing up a directory of exactly repeated patterns appearing in a given majority of strings. Second, the algorithm constructs recursively anchoring patterns by a divide-and-conquer strategy and converges on a maximum number of alignments. This algorithm has been applied to find common a priori unknown features in families of biological macromolecules, with quite good results. One of these families included 23 strings of about 100 characters each. Each characteristic structure has been achieved within less than one minute on a MULTIX-DPS8 system.