The nature of statistical learning theory
The nature of statistical learning theory
Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization
Machine Learning - Special issue on applications in molecular biology
Finding motifs using random projections
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Pattern Recognition and Neural Networks
Pattern Recognition and Neural Networks
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
On the Approximate Pattern Occurrences in a Text
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
An iterative hypothesis-testing strategy for pattern discovery
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance
MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Discovering Frequent Episodes and Learning Hidden Markov Models: A Formal Connection
IEEE Transactions on Knowledge and Data Engineering
A study of the effects of bias in criterion functions for temporal data clustering
Proceedings of the 43rd annual Southeast regional conference - Volume 1
Modeling student online learning using clustering
Proceedings of the 44th annual Southeast regional conference
Constructing comprehensive summaries of large event sequences
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
EventSummarizer: a tool for summarizing large event sequences
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Constructing comprehensive summaries of large event sequences
ACM Transactions on Knowledge Discovery from Data (TKDD)
Rule generation for categorical time series with Markov assumptions
Statistics and Computing
Alternative Approach to Tree-Structured Web Log Representation and Mining
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Proceedings of the 20th ACM international conference on Information and knowledge management
Hi-index | 0.00 |
In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects of this data mining problem that can make discovery "easy" or "hard." We present a general framework for characterizing learning in this context by deriving the Bayes error rate for this problem under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.