A modification of the Lloyd algorithm for k-anonymous quantization

  • Authors:
  • David Rebollo-Monedero;Jordi Forné;Esteve PallarèS;Javier Parra-Arnau

  • Affiliations:
  • Dept. of Telematics Engineering, Universitat Politécnica de Catalunya, C. Jordi Girona 1-3, E-08034 Barcelona, Spain;Dept. of Telematics Engineering, Universitat Politécnica de Catalunya, C. Jordi Girona 1-3, E-08034 Barcelona, Spain;Dept. of Telematics Engineering, Universitat Politécnica de Catalunya, C. Jordi Girona 1-3, E-08034 Barcelona, Spain;Dept. of Telematics Engineering, Universitat Politécnica de Catalunya, C. Jordi Girona 1-3, E-08034 Barcelona, Spain

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.07

Visualization

Abstract

We address the problem of designing quantizers that cluster data while satisfying a k-anonymity requirement. A general data compression perspective is adopted, which considers both discrete and continuous probability distributions, and corresponding constraints on both cell sizes and quantizer index probabilities. Potential applications of this problem extend well beyond the important case of microdata anonymization, to include also optimized task allocation under workload constraints. Our contribution is twofold. First and most importantly, we present a theoretical analysis showing the optimality conditions which probability-constrained quantizers must satisfy, thereby theoretically characterizing optimal k-anonymous aggregation as a special case. As a second contribution, inspired by our theoretical analysis, we propose an alternating optimization algorithm for the design of this type of quantizers. Our algorithm is conceptually motivated by the popular Lloyd-Max algorithm for quantization design, originally intended for data compression, also known as the k-means method. Experimental results for synthetic and real data, with mean squared error as a distortion measure, confirm that our method outperforms MDAV, a popular fixed-size microaggregation algorithm for statistical disclosure control. This performance improvement is in terms of data utility, for the exact same k-anonymity constraint, but does come at the expense of higher computational sophistication.