Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SPADE: an efficient algorithm for mining frequent sequences
Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth
ICDE '01 Proceedings of the 17th International Conference on Data Engineering
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach
IEEE Transactions on Knowledge and Data Engineering
Grid's confidential outsourcing of string matching
SEPADS'07 Proceedings of the 6th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
Generalization of pattern-growth methods for sequential pattern mining with gap constraints
MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
A minimum cost process in searching for a set of similar DNA sequences
TELE-INFO'06 Proceedings of the 5th WSEAS international conference on Telecommunications and informatics
Hi-index | 0.00 |
We want to find sequential patterns in a long continues noisy DNA sequence. Sequential pattern mining, which discovers frequent subsequences as patterns in a sequence database, is an important data mining problem with broad applications, including the analysis of customer purchase patterns or Web access patterns and analysis of DNA sequences, and so on. We investigated sequential pattern mining algorithms for long continues DNA sequences. Most previously proposed mining algorithms follow the exact matching with a sequential pattern definition. They are not able to work in noisy environments and inaccurate data in practice. We investigated approximate matching method to deal with those cases. In this paper, we develop and apply Pattern-Growth PrefixSpan algorithm to find most repeated patterns, for example, motifs in DNA sequence. Our algorithm gains its efficiency by using pattern growth and approximation methodologies. The algorithm is based on the observation that all occurrences of a frequent pattern can be classified into groups, which we call approximated pattern. We developed algorithms to quickly find out all relative frequents by a pattern growth method and to determine approximated patterns from those frequents. Our experimental studies demonstrate that our algorithm is efficient in mining repeated approximate sequential patterns that would have been missed by existing methods.