Generalised Sequence Signatures through symbolic clustering

Authors:
Dietmar H. Dorr;Anne M. Denton
Affiliations:
Research and Development, Thomson Reuters, St. Paul, MN 55123, USA.;Department of Computer Science, North Dakota State University, Fargo, ND 58102, USA
Venue:
International Journal of Data Mining and Bioinformatics
Year:
2010

Citing 10
Cited 0

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Probabilistic discovery of time series motifs

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
New techniques for extracting features from protein sequences

IBM Systems Journal - Deep computing for the life sciences
A generic motif discovery algorithm for sequential data

Bioinformatics
tuple_plot: Fast pairwise nucleotide sequence comparison with noise suppression

Bioinformatics
Clustering sequences by overlap

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally sequence motifs and domains are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of Generalised Sequence Signatures (GSS) that can be composed of windows distributed throughout the sequence. Our approach is based on clustering analysis of recurring subsequences of a predefined length, to which we refer as symbols. Sequences are grouped so as to maximise the number of shared symbols among them. We show that the utilisation of GSS for deriving sequence annotations yields higher confidence values than the usage of other signature recognition approaches.