Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
Data summarization has been recognized as a fundamental operation in database systems and data mining with important applications such as data compression and privacy preservation. While the existing methods such as CFvalues and DataBubbles may perform reasonably well, they cannot provide any guarantees on the quality of their results. In this paper, we introduce a summarization approach for numerical data based on discs formalizing the notion of quality. Our objective is to find a minimal set of discs, i.e. spheres satisfying a radius and a significance constraint, covering the given dataset. Since the proposed problem is NP-complete, we design two different approximation algorithms. These algorithms have a quality guarantee, but they do not scale well to large databases. However, the machinery from approximation algorithms allows a precise characterization of a further, heuristic algorithm. This heuristic, efficient algorithm exploits multi-dimensional index structures and can be well-integrated with database systems. The experiments show that our heuristic algorithm generates summaries that outperform the state-of-the-art Data Bubbles in terms of internal measures as well as in terms of external measures when using the data summaries as input for clustering methods.