A framework for mining evolving trends in web data streams using dynamic learning and retrospective validation

Authors:
Olfa Nasraoui;Carlos Rojas;Cesar Cardona
Affiliations:
Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY;Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY;Magnify Inc., Chicago
Venue:
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Year:
2006

Citing 18
Cited 9

Flocks, herds and schools: A distributed behavioral model

SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
From user access patterns to dynamic hypertext linking

Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Information storage and retrieval

Information storage and retrieval
Adaptive Web sites: automatically synthesizing Web pages

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Requirements for clustering data streams

ACM SIGKDD Explorations Newsletter
Information Retrieval Systems: Theory and Implementation

Information Retrieval Systems: Theory and Implementation
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Continuous queries over data streams

ACM SIGMOD Record
Text-Learning and Related Intelligent Agents: A Survey

IEEE Intelligent Systems
Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs

ADL '98 Proceedings of the Advances in Digital Libraries Conference
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Knowledge discovery from users Web-page navigation

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Web usage mining: discovery and applications of usage patterns from Web data

ACM SIGKDD Explorations Newsletter
TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams with a Scalable Immune System Learning Model

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Multi-dimensional regression analysis of time-series data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Ant system: optimization by a colony of cooperating agents

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Density-based clustering for real-time stream data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce

Proceedings of the ninth international conference on Electronic commerce
Application areas of AIS: The past, the present and the future

Applied Soft Computing
EIN-WUM: an AIS-based algorithm for web usage mining

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Incremental clustering of dynamic data streams using connectivity based representative points

Data & Knowledge Engineering
The Mahalanobis-Taguchi system - Neural network algorithm for data-mining in dynamic environments

Expert Systems with Applications: An International Journal
Stream data clustering based on grid density and attraction

ACM Transactions on Knowledge Discovery from Data (TKDD)
Modeling a dynamic design system using the Mahalanobis Taguchi system: two-step optimal algorithm

ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume Part III
Pattern discovery in data streams under the time warping distance

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the "you only get to see it once" constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.