MAIL: mining sequential patterns with wildcards

Authors:
Fei Xie;Xindong Wu;Xuegang Hu;Jun Gao;Dan Guo;Yulian Fei;Ertian Hua
Affiliations:
College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China/ Department of Computer Science and Technology, Hefei Normal University, Hefei 230601, ...;College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China/ Department of Computer Science, University of Vermont, Burlington, VT 05405, USA;College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China;College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China;College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China;College of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China;College of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China
Venue:
International Journal of Data Mining and Bioinformatics
Year:
2013

Citing 15
Cited 1

SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Sequential PAttern mining using a bitmap representation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining periodic patterns with gap requirement from sequences

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
An efficient motif discovery algorithm with unknown motif length and number of binding sites

International Journal of Data Mining and Bioinformatics
Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Mining complex patterns across sequences with gap requirements

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Mining Frequent Patterns with Gaps and One-Off Condition

CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 01
Synthetic gene design with a large number of hidden stops

International Journal of Data Mining and Bioinformatics
Mining top−k frequent patterns without minimum support threshold

Knowledge and Information Systems
A unified view of the apriori-based algorithms for frequent episode discovery

Knowledge and Information Systems
Text document clustering using global term context vectors

Knowledge and Information Systems

Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sequential pattern mining is an important research task in many domains, such as biological science. In this paper, we study the problem of mining frequent patterns from sequences with wildcards. The user can specify the gap constraints with flexibility. Given a subject sequence, a minimal support threshold and a gap constraint, we aim to find frequent patterns whose supports in the sequence are no less than the given support threshold. We design an efficient mining algorithm MAIL. Two pattern growth strategies are proposed to improve the completeness and the time efficiency. One is based on the candidate occurrence pruning, and the other uses an occurrence graph. A random data generator is designed to test the completeness on artificial data. Experiments on DNA sequences show that MAIL mines four times more patterns than one of its peers and the time performance is six times faster on average than its another peer. We also give a concrete example in which our algorithm is applied on DNA sequences to find interesting patterns.