Flocks, herds and schools: A distributed behavioral model
SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
From user access patterns to dynamic hypertext linking
Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Optimization of inverted vector searches
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Information storage and retrieval
Information storage and retrieval
Adaptive Web sites: automatically synthesizing Web pages
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Requirements for clustering data streams
ACM SIGKDD Explorations Newsletter
Information Retrieval Systems: Theory and Implementation
Information Retrieval Systems: Theory and Implementation
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Continuous queries over data streams
ACM SIGMOD Record
Text-Learning and Related Intelligent Agents: A Survey
IEEE Intelligent Systems
Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs
ADL '98 Proceedings of the Advances in Digital Libraries Conference
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Knowledge discovery from users Web-page navigation
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Web usage mining: discovery and applications of usage patterns from Web data
ACM SIGKDD Explorations Newsletter
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Multi-dimensional regression analysis of time-series data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Ant system: optimization by a colony of cooperating agents
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Density-based clustering for real-time stream data
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce
Proceedings of the ninth international conference on Electronic commerce
Application areas of AIS: The past, the present and the future
Applied Soft Computing
EIN-WUM: an AIS-based algorithm for web usage mining
Proceedings of the 10th annual conference on Genetic and evolutionary computation
Incremental clustering of dynamic data streams using connectivity based representative points
Data & Knowledge Engineering
The Mahalanobis-Taguchi system - Neural network algorithm for data-mining in dynamic environments
Expert Systems with Applications: An International Journal
Stream data clustering based on grid density and attraction
ACM Transactions on Knowledge Discovery from Data (TKDD)
Modeling a dynamic design system using the Mahalanobis Taguchi system: two-step optimal algorithm
ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume Part III
Pattern discovery in data streams under the time warping distance
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the "you only get to see it once" constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.