Pattern discovery in sequences under a Markov assumption

Authors:
Darya Chudova;Padhraic Smyth
Affiliations:
University of California, Irvine, CA;University of California, Irvine, CA
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 6
Cited 11

The nature of statistical learning theory

The nature of statistical learning theory
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
On the Approximate Pattern Occurrences in a Text

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997

An iterative hypothesis-testing strategy for pattern discovery

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Discovering Frequent Episodes and Learning Hidden Markov Models: A Formal Connection

IEEE Transactions on Knowledge and Data Engineering
A study of the effects of bias in criterion functions for temporal data clustering

Proceedings of the 43rd annual Southeast regional conference - Volume 1
Modeling student online learning using clustering

Proceedings of the 44th annual Southeast regional conference
Constructing comprehensive summaries of large event sequences

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
EventSummarizer: a tool for summarizing large event sequences

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Constructing comprehensive summaries of large event sequences

ACM Transactions on Knowledge Discovery from Data (TKDD)
Rule generation for categorical time series with Markov assumptions

Statistics and Computing
Alternative Approach to Tree-Structured Web Log Representation and Mining

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Natural event summarization

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects of this data mining problem that can make discovery "easy" or "hard." We present a general framework for characterizing learning in this context by deriving the Bayes error rate for this problem under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.