Scalable sequential pattern mining for biological sequences

  • Authors:
  • Ke Wang;Yabo Xu;Jeffrey Xu Yu

  • Affiliations:
  • Simon Fraser University;Simon Fraser University and Chinese University of Hong Kong;Chinese University of Hong Kong

  • Venue:
  • Proceedings of the thirteenth ACM international conference on Information and knowledge management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Biosequences typically have a small alphabet, a long length, and patterns containing gaps (i.e., "don't care") of arbitrary size. Mining frequent patterns in such sequences faces a different type of explosion than in transaction sequences primarily motivated in market-basket analysis. In this paper, we study how this explosion affects the classic sequential pattern mining, and present a scalable two-phase algorithm to deal with this new explosion. The Segment Phase first searches for short patterns containing no gaps, called segments. This phase is efficient. The Pattern Phase searches for long patterns containing multiple segments separated by variable length gaps. This phase is time consuming. The purpose of two phases is to exploit the information obtained from the first phase to speed up the pattern growth and matching and to prune the search space in the second phase. We evaluate this approach on synthetic and real life data sets.