Discovering subword associations in strings in time linear in the output size

Authors:
Alberto Apostolico;Giorgio Satta
Affiliations:
College of Computing, Georgia Institute of Technology, 801 Atlantic Dr., Atlanta 30332 GA, USA and Department of Information Engineering, University of Padua, via Gradenigo, 6/A, I-35131 Padova, I ...;Department of Information Engineering, University of Padua, via Gradenigo, 6/A, I-35131 Padova, Italy
Venue:
Journal of Discrete Algorithms
Year:
2009

Citing 15
Cited 1

On finding lowest common ancestors: simplification and parallelization

SIAM Journal on Computing
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Text algorithms

Text algorithms
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text

Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text
Simple and flexible detection of contiguous repeats using a suffix tree

Theoretical Computer Science
Knowledge Discovery in Databases

Knowledge Discovery in Databases
An Algorithm for Approximate Tandem Repeats

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Finding Repeats with Fixed Gap

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Word sequence kernels

The Journal of Machine Learning Research
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Characterization and extraction of irredundant tandem motifs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a text string x of n symbols and an integer constant d, we consider the problem of finding, for any pair (y,z) of subwords of x, the tandem index associated with the pair, which is defined as the number of times that y and z occur in tandem (i.e., with no intermediate occurrence of either one of them) within a distance of d symbols of x. Although in principle there might be O(n^4) distinct subword pairs in x, it is seen that it suffices to consider a family of only O(n^2) such pairs, with the property that for any neglected pair (y^',z^') there exists a corresponding pair (y,z) contained in our family such that: (i) y^' is a prefix of y and z^' is a prefix of z; and (ii) the tandem index of (y^',z^') equals that of (y,z). The main contribution of the paper consists of an algorithm showing that the computation of all non-zero tandem indices for a string can be carried out optimally in time and space linear in the size of the output.