PMBC: Pattern mining from biological sequences with wildcard constraints

Authors:
Xindong Wu;Xingquan Zhu;Yu He;Abdullah N. Arslan
Affiliations:
Department of Computer Science, University of Vermont, Burlington, VT 05401, USA;Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA;Department of Computer Science, University of Vermont, Burlington, VT 05401, USA;Department of Computer Science, University of Vermont, Burlington, VT 05401, USA
Venue:
Computers in Biology and Medicine
Year:
2013

Citing 30
Cited 0

An algorithm for string matching with a sequence of don't cares

Information Processing Letters
Faster algorithms for string matching with k mismatches

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Efficient pattern-matching with don't cares

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery
Constraint-Based, Multidimensional Data Mining

Computer
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Discovering Best Variable-Length-Don't-Care Patterns

DS '02 Proceedings of the 5th International Conference on Discovery Science
Faster Algorithms for String Matching Problems: Matching the Convolution Bound

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Efficient Mining of Partial Periodic Patterns in Time Series Database

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Finding Constrained Frequent Episodes Using Minimal Occurrences

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Mining periodic patterns with gap requirement from sequences

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Extracting Frequent Subsequences from a Single Long Data Sequence: A Novel Anti-Monotonic Measure and a Simple On-Line Algorithm

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Constraint-based sequential pattern mining: the pattern-growth methods

Journal of Intelligent Information Systems
Soft constraint based pattern mining

Data & Knowledge Engineering
Decentralized load balancing for highly irregular search problems

Microprocessors & Microsystems
Sequence-based protein structure prediction using a reduced state-space hidden Markov model

Computers in Biology and Medicine
A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue

Artificial Intelligence in Medicine
A two-stage methodology for sequence classification based on sequential pattern mining and optimization

Data & Knowledge Engineering
A constraint-based querying system for exploratory pattern discovery

Information Systems
Top-down mining of frequent closed patterns from very high dimensional data

Information Sciences: an International Journal
Establishing relationships among patterns in stock market data

Data & Knowledge Engineering
Mining complex patterns across sequences with gap requirements

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Approximate Repeating Pattern Mining with Gap Requirements

ICTAI '09 Proceedings of the 2009 21st IEEE International Conference on Tools with Artificial Intelligence
A binary decision diagram based approach for mining frequent subsequences

Knowledge and Information Systems
Efficient Mining of Gap-Constrained Subsequences and Its Various Applications

ACM Transactions on Knowledge Discovery from Data (TKDD)
Mining top−k frequent patterns without minimum support threshold

Knowledge and Information Systems
Early classification on time series

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Patterns/subsequences frequently appearing in sequences provide essential knowledge for domain experts, such as molecular biologists, to discover rules or patterns hidden behind the data. Due to the inherent complex nature of the biological data, patterns rarely exactly reproduce and repeat themselves, but rather appear with a slightly different form in each of its appearances. A gap constraint (In this paper, a gap constraint (also referred to as a wildcard) is a character that can be substituted for any character predefined in an alphabet.) provides flexibility for users to capture useful patterns even if their appearances vary in the sequences. In order to find patterns, existing tools require users to explicitly specify gap constraints beforehand. In reality, it is often nontrivial or time-consuming for users to provide proper gap constraint values. In addition, a change made to the gap values may give completely different results, and require a separate time-consuming re-mining procedure. Therefore, it is desirable to automatically and efficiently find patterns without involving user-specified gap requirements. In this paper, we study the problem of frequent pattern mining without user-specified gap constraints and propose PMBC (namely P@?atternM@?ining from B@?iological sequences with wildcard C onstraints) to solve the problem. Given a sequence and a support threshold value (i.e. pattern frequency threshold), PMBC intends to discover all subsequences with their support values equal to or greater than the given threshold value. The frequent subsequences then form patterns later on. Two heuristic methods (one-way vs. two-way scans) are proposed to discover frequent subsequences and estimate their frequency in the sequences. Experimental results on both synthetic and real-world DNA sequences demonstrate the performance of both methods for frequent pattern mining and pattern frequency estimation.