An inexact-suffix-tree-based algorithm for detecting extensible patterns

Authors:
Abhijit Chattaraj;Laxmi Parida
Affiliations:
School of Computer Science & Information Technology, RMIT University, Melbourne, Australia;Computational Biology Center, IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
Theoretical Computer Science - Pattern discovery in the post genome
Year:
2005

Citing 6
Cited 4

Introduction to algorithms

Introduction to algorithms
Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Motif discovery without alignment or enumeration (extended abstract)

RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Compact recognizers of episode sequences

Information and Computation
A Double Combinatorial Approach to Discovering Patterns in Biological Sequences

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching

The subsequence composition of a string

Theoretical Computer Science
VARUN: Discovering Extensible Motifs under Saturation Constraints

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Dotted suffix trees a structure for approximate text indexing

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Analysis of the spatial and temporal locality in data accesses

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given an input sequence of data, a rigid pattern is a repeating sequence, possibly interspersed with dont-care characters. The data could be a sequence of characters or sets of characters or even real values. In practice, the patterns or motifs of interest are the ones that also allow a variable number of gaps (or dont-care characters): these are patterns with spacers termed extensible patterns In a bioinformatics context, similar patterns have also been called flexible patterns or motifs. The extensibility is succinctly defined by a single integer parameter D ≥ 1 which is interpreted as the allowable space to be between 1 and D characters between two successive solid characters in a reported motif. We introduce a data structure called the inexact-suffix tree and present an algorithm based on this data structure. This has been tested on primarily biological data such as DNA and protein sequences. However the generality of the system makes it equally applicable in other data mining, clustering, and knowledge extraction applications.