An efficient algorithm for identifying the most contributory substring

Authors:
Ben Stephenson
Affiliations:
Department of Computer Science, University of Western Ontario, London, Ontario, Canada
Venue:
DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
Year:
2007

Citing 8
Cited 0

A note on the height of suffix trees

SIAM Journal on Computing
Self-alignments in words and their applications

Journal of Algorithms
Autocorrelation on words and its applications: analysis of suffix trees by string-ruler approach

Journal of Combinatorial Theory Series A
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Constructing Suffix Trees On-Line in Linear Time

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting repeated portions of strings has important applications to many areas of study including data compression and computational biology. This paper defines and presents a solution for the Most Contributory Substring Problem, which identifies the single substring that represents the largest proportion of the characters within a set of strings. We show that a solution to the problem can be achieved with an O(n) running time (where n is the total number of characters in all of the input strings) when overlapping occurrences of the most contributory substring are permitted. Furthermore, we present an extended algorithm that does not permit occurrences of the most contributory substring to overlap. The expected running time of the extended algorithm is O(n log n) while its worst case performance is O(n2).