Maximal words in sequence comparisons based on subword composition

Authors:
Alberto Apostolico
Affiliations:
Georgia Institute of Technology & Università di Padova
Venue:
Algorithms and Applications
Year:
2010

Citing 7
Cited 0

Elements of information theory

Elements of information theory
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Data mining: concepts and techniques

Data mining: concepts and techniques
Three great challenges for half-century-old computer science

Journal of the ACM (JACM)
Linear time algorithm for isomorphism of planar graphs (Preliminary Report)

STOC '74 Proceedings of the sixth annual ACM symposium on Theory of computing
Metrics for comparing regulatory sequences on the basis of pattern counts

Bioinformatics
The similarity metric

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measures of sequence similarity and distance based more or less explicitly on subword composition are attracting an increasing interest driven by intensive applications such as massive document classification and genome-wide molecular taxonomy. A uniform character of such measures is in some underlying notion of relative compressibility, whereby two similar sequences are expected to share a larger number of common substrings than two distant ones. This paper reviews some of the approaches to sequence comparison based on subword composition and suggests that their common denominator may ultimately reside in special classes of subwords, the nature of which resonates in interesting ways with the structure of popular subword trees and graphs.