Kd-trees and the real disclosure risks of large statistical databases

Authors:
Javier Herranz;Jordi Nin;Marc Solé
Affiliations:
Dept. of Matemítica Aplicada IV, UPC, Barcelona, Spain;Dept. of Arquitectura de Computadors, UPC, Barcelona, Spain;Dept. of Arquitectura de Computadors, UPC, Barcelona, Spain
Venue:
Information Fusion
Year:
2012

Citing 16
Cited 1

Security-control methods for statistical databases: a comparative study

ACM Computing Surveys (CSUR)
An optimal algorithm for approximate nearest neighbor searching fixed dimensions

Journal of the ACM (JACM)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Practical Data-Oriented Microaggregation for Statistical Disclosure Control

IEEE Transactions on Knowledge and Data Engineering
Microdata Protection through Noise Addition

Inference Control in Statistical Databases, From Theory to Practice
Disclosure risk assessment in statistical microdata protection via advanced record linkage

Statistics and Computing
Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata

Data Mining and Knowledge Discovery
Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

IEEE Transactions on Knowledge and Data Engineering
\ell -Diversity: Privacy Beyond \kappa -Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Privacy Protection: p-Sensitive k-Anonymity Property

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Efficient multivariate data-oriented microaggregation

The VLDB Journal — The International Journal on Very Large Data Bases
Gorder: an efficient method for KNN join processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Rethinking rank swapping to decrease disclosure risk

Data & Knowledge Engineering
From t-Closeness-Like Privacy to Postrandomization via Information Theory

IEEE Transactions on Knowledge and Data Engineering
Optimal Symbol Alignment Distance: A New Distance for Sequences of Symbols

IEEE Transactions on Knowledge and Data Engineering

Information fusion in data privacy: A survey

Information Fusion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Estimating the disclosure risk of a Statistical Disclosure Control (SDC) protection method by means of (distance-based) record linkage techniques is a very popular approach to analyze the privacy level offered by such a method. When databases are very large, some particular record linkage techniques such as blocking or partitioning are usually applied to make this process reasonably efficient. However, in this case the record linkage process is not exact, which means that the disclosure risk of a SDC protection method may be underestimated. In this paper we propose the use of kd-trees techniques to apply exact yet very efficient record linkage when (protected) datasets are very large. We describe some experiments showing that this approach achieves better results, in terms of both accuracy and running time, than more classical approaches such as record linkage based on a sliding window. We also discuss and experiment on the use of these techniques not to link a whole protected record with its original one, but just to guess the value of some confidential attribute(s) of the record(s). This fact leads to concepts such as k-neighbor l-diversity or k-neighbor p-sensitivity, a generalization (to any SDC protection method) of l-diversity or p-sensitivity, which have been defined for SDC protection methods ensuring k-anonymity, such as microaggregation.