FS-Miner: efficient and incremental mining of frequent sequence patterns in web logs

  • Authors:
  • Maged El-Sayed;Carolina Ruiz;Elke A. Rundensteiner

  • Affiliations:
  • Worcester Polytechnic Institute;Worcester Polytechnic Institute;Worcester Polytechnic Institute

  • Venue:
  • Proceedings of the 6th annual ACM international workshop on Web information and data management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Mining frequent patterns is an important component of many prediction systems. One common usage in web applications is the mining of users' access behavior for the purpose of predicting and hence pre-fetching the web pages that the user is likely to visit. In this paper we introduce an efficient strategy for discovering frequent patterns in sequence databases that requires only two scans of the database. The first scan obtains support counts for subsequences of length two. The second scan extracts potentially frequent sequences of any length and represents them as a compressed frequent sequences tree structure (FS-tree). Frequent sequence patterns are then mined from the FS-tree. Incremental and interactive mining functionalities are also facilitated by the FS-tree. As part of this work, we developed the FS-Miner, a system that discovers frequent sequences from web log files. The FS-Miner has the ability to adapt to changes in users' behavior over time, in the form of new input sequences, and to respond incrementally without the need to perform full re-computation. Our system also allows the user to change the input parameters (e.g., minimum support and desired pattern size) interactively without requiring full re-computation in most cases. We have tested our system comparing it against two other algorithms from the literature. Our experimental results show that our system scales up linearly with the size of the input database. Furthermore, it exhibits excellent adaptability to support threshold decreases. We also show that the incremental update capability of the system provides significant performance advantages over full re-computation even for relatively large update sizes.