Discovery of non-induced patterns from sequences

Authors:
Andrew K. C. Wong;Dennis Zhuang;Gary C. L. Li;En-Shiun Annie Lee
Affiliations:
Department of System Design University of Waterloo, Waterloo, Ontario, Canada;Department of System Design University of Waterloo, Waterloo, Ontario, Canada;Department of System Design University of Waterloo, Waterloo, Ontario, Canada;Department of System Design University of Waterloo, Waterloo, Ontario, Canada
Venue:
PRIB'10 Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics
Year:
2010

Citing 5
Cited 0

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
High-Order Pattern Discovery from Discrete-Valued Data

IEEE Transactions on Knowledge and Data Engineering
Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discovering patterns from sequence data has significant impact in genomics, proteomics and business. A problem commonly encountered is that the patterns discovered often contain many redundancies resulted from fake significant patterns induced by their strong statistically significant subpatterns. The concept of statistically induced patterns is proposed to capture these redundancies. An algorithm is then developed to efficiently discover noninduced significant patterns from a large sequence dataset. For performance evaluation, two experiments were conducted to demonstrate a) the seriousness of the problem using synthetic data and b) top non-induced significant patterns discovered from Saccharomyces cerevisiae (Yeast) do correspond to the transcription factor binding sites found by the biologists. The experiments confirm the effectiveness of our method in generating a relatively small set of patterns revealing interesting, unknown information inherent in the sequences.