Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Finding motifs using random projections
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Probabilistic discovery of time series motifs
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
New techniques for extracting features from protein sequences
IBM Systems Journal - Deep computing for the life sciences
A generic motif discovery algorithm for sequential data
Bioinformatics
Clustering sequences by overlap
International Journal of Data Mining and Bioinformatics
Hi-index | 0.00 |
Traditionally sequence motifs and domains are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of Generalised Sequence Signatures (GSS) that can be composed of windows distributed throughout the sequence. Our approach is based on clustering analysis of recurring subsequences of a predefined length, to which we refer as symbols. Sequences are grouped so as to maximise the number of shared symbols among them. We show that the utilisation of GSS for deriving sequence annotations yields higher confidence values than the usage of other signature recognition approaches.