Concept Clustering of Evolving Data

Authors:
Shixi Chen;Haixun Wang;Shuigeng Zhou
Affiliations:
-;-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 2

An algorithmic approach to event summarization

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Finding semantics in time series

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Web search, a user refines his search several times before he finds the information he needs. It is very likely that, in the search log, similar sequences of searches appear many times, as many users had searched the Web with the same intent. Precisely interpreting the intent of the user is difficult, even with the help of the search log: there might be numerous instances of such intent scattering in small pieces in the log, but none of them is comprehensive enough to describe the concept precisely. This scenario occurs in many applications. For example, patterns in Web search, Internet traffic, program execution traces, network events, etc., are often non-stationary, yet the same patterns recur over time. In this paper, we argue that visible patterns are generated by hidden intent or hidden concepts, and precisely characterizing such concepts is only possible if we cluster as much data generated by such concepts as possible and learn from the clustered data as a whole, instead of learning from a single episode of such concept. The benefits is obvious as it enables us not only to better understand the underlying system that generates the data, but also to recognize future instance of a concept as soon as it occurs. To achieve this, we introduce a clustering based approach, where we adopt a novel clustering criterion, validation error minimization, to ensure that the found concepts are unique and precise. We propose a two step algorithm, which uses enhanced dynamic programming and EM like methods for clustering. Experiments show that in benchmark datasets, our approach achieves the highest accuracy with lowest cost in comparison with the current best approaches.