TFRP: An efficient microaggregation algorithm for statistical disclosure control

  • Authors:
  • Chin-Chen Chang;Yu-Chiang Li;Wen-Hung Huang

  • Affiliations:
  • Department of Information Engineering and Computer Science, Feng Chia University, 100 Wenhwa Rd., Seatwen, Taichung 40724, Taiwan, ROC and Department of Computer Science and Information Engineerin ...;Department of Computer Science and Information Engineering, National Chung Cheng University, 168, University Rd., San-Hsing, Min-Hsiung, Chiayi 62102, Taiwan, ROC;Institute of Information Systems and Applications, National Tsing Hua University, 101, Section 2, Kuang-Fu Rd., Hsinchu 30013, Taiwan, ROC

  • Venue:
  • Journal of Systems and Software
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

Recently, the issue of statistic disclosure control (SDC) has attracted much attention. SDC is a very important part of data security dealing with the protection of databases. Microaggregation for SDC techniques is widely used to protect confidentiality in statistical databases released for public use. The basic problem of microaggregation is that similar records are clustered into groups, and each group contains at least k records to prevent disclosure of individual information, where k is a pre-defined security threshold. For a certain k, an optimal multivariable microaggregation has the lowest information loss. The minimum information loss is an NP-hard problem. Existing fixed-size techniques can obtain a low information loss with O(n2) or O(n3/k) time complexity. To improve the execution time and lower information loss, this study proposes the Two Fixed Reference Points (TFRP) method, a two-phase algorithm for microaggregation. In the first phase, TFRP employs the pre-computing and median-of-medians techniques to efficiently shorten its running time to O(n2/k). To decrease information loss in the second phase, TFRP generates variable-size groups by removing the lower homogenous groups. Experimental results reveal that the proposed method is significantly faster than the Diameter and the Centroid methods. Running on several test datasets, TFRP also significantly reduces information loss, particularly in sparse datasets with a large k.