A Disc-based Approach to Data Summarization and Privacy Preservation

  • Authors:
  • Rong Ge;Martin Ester;Wen Jin;Zengjian Hu

  • Affiliations:
  • Simon Fraser University;Simon Fraser University;Simon Fraser University;Simon Fraser University

  • Venue:
  • SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data summarization has been recognized as a fundamental operation in database systems and data mining with important applications such as data compression and privacy preservation. While the existing methods such as CFvalues and DataBubbles may perform reasonably well, they cannot provide any guarantees on the quality of their results. In this paper, we introduce a summarization approach for numerical data based on discs formalizing the notion of quality. Our objective is to find a minimal set of discs, i.e. spheres satisfying a radius and a significance constraint, covering the given dataset. Since the proposed problem is NP-complete, we design two different approximation algorithms. These algorithms have a quality guarantee, but they do not scale well to large databases. However, the machinery from approximation algorithms allows a precise characterization of a further, heuristic algorithm. This heuristic, efficient algorithm exploits multi-dimensional index structures and can be well-integrated with database systems. The experiments show that our heuristic algorithm generates summaries that outperform the state-of-the-art Data Bubbles in terms of internal measures as well as in terms of external measures when using the data summaries as input for clustering methods.