Mining sequential patterns by PrefixSpan algorithm with approximation

  • Authors:
  • Ankhbayar Yukhuu;Sansarbold Garamragchaa;Hwang Young Sup

  • Affiliations:
  • Department of Computer Science, Sun Moon University, Asan, Chugnam, South Korea;Department of Computer Science, Sun Moon University, Asan, Chugnam, South Korea;Department of Computer Science, Sun Moon University, Asan, Chugnam, South Korea

  • Venue:
  • ACS'08 Proceedings of the 8th conference on Applied computer scince
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We want to find sequential patterns in a long continues noisy DNA sequence. Sequential pattern mining, which discovers frequent subsequences as patterns in a sequence database, is an important data mining problem with broad applications, including the analysis of customer purchase patterns or Web access patterns and analysis of DNA sequences, and so on. We investigated sequential pattern mining algorithms for long continues DNA sequences. Most previously proposed mining algorithms follow the exact matching with a sequential pattern definition. They are not able to work in noisy environments and inaccurate data in practice. We investigated approximate matching method to deal with those cases. In this paper, we develop and apply Pattern-Growth PrefixSpan algorithm to find most repeated patterns, for example, motifs in DNA sequence. Our algorithm gains its efficiency by using pattern growth and approximation methodologies. The algorithm is based on the observation that all occurrences of a frequent pattern can be classified into groups, which we call approximated pattern. We developed algorithms to quickly find out all relative frequents by a pattern growth method and to determine approximated patterns from those frequents. Our experimental studies demonstrate that our algorithm is efficient in mining repeated approximate sequential patterns that would have been missed by existing methods.