DSM-PLW: single-pass mining of path traversal patterns over streaming web click-sequences

  • Authors:
  • Hua-Fu Li;Suh-Yin Lee;Man-Kwan Shan

  • Affiliations:
  • Department of Computer Science and Information Engineering, National Chiao-Tung University, Hsinchu, Taiwan, ROC;Department of Computer Science and Information Engineering, National Chiao-Tung University, Hsinchu, Taiwan, ROC;Department of Computer Science, National Chengchi University, Wenshan, Taipei, Taiwan, ROC

  • Venue:
  • Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

Mining Web click streams is an important data mining problem with broad applications. However, it is also a difficult problem since the streaming data possess some interesting characteristics, such as unknown or unbounded length, possibly a very fast arrival rate, inability to backtrack over previously arrived click-sequences, and a lack of system control over the order in which the data arrive. In this paper, we propose a projection-based, single-pass algorithm, called DSM-PLW (Data Stream Mining for Path traversal patterns in a Landmark Window), for online incremental mining of path traversal patterns over a continuous stream of maximal forward references generated at a rapid rate. According to the algorithm, each maximal forward reference of the stream is projected into a set of reference-suffix maximal forward references, and these reference-suffix maximal forward references are inserted into a new in-memory summary data structure, called SP-forest (Summary Path traversal pattern forest), which is an extended prefix tree-based data structure for storing essential information about frequent reference sequences of the stream so far. The set of all maximal reference sequences is determined from the SP-forest by a depth-first-search mechanism, called MRS-mining (Maximal Reference Sequence mining). Theoretical analysis and experimental studies show that the proposed algorithm has gently growing memory requirements and makes only one pass over the streaming data.