Achieving anonymity via clustering

Authors:
Gagan Aggarwal;Rina Panigrahy;Tomás Feder;Dilys Thomas;Krishnaram Kenthapadi;Samir Khuller;An Zhu
Affiliations:
Google Inc., Mountian View, CA;Microsoft Research, Mountian View, CA;Stanford University, Stanford, CA;Oracle, Redwood Shores, CA;Microsoft Research, Mountain View, CA;University of Maryland, College Park, MD;Google Inc., Mountian View, CA
Venue:
ACM Transactions on Algorithms (TALG)
Year:
2010

Citing 13
Cited 7

How to allocate network centers

Journal of Algorithms
The Capacitated K-Center Problem

SIAM Journal on Discrete Mathematics
Algorithms for facility location problems with outliers

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Computers and Intractability; A Guide to the Theory of NP-Completeness

Computers and Intractability; A Guide to the Theory of NP-Completeness
Protecting Respondents' Identities in Microdata Release

IEEE Transactions on Knowledge and Data Engineering
Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Building Steiner trees with incomplete global knowledge

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Hierarchical placement and network design problems

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Data Privacy through Optimal k-Anonymization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
On the complexity of optimal K-anonymity

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Incognito: efficient full-domain K-anonymity

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
\ell -Diversity: Privacy Beyond \kappa -Anonymity

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Toward privacy in public databases

TCC'05 Proceedings of the Second international conference on Theory of Cryptography

Pattern-guided data anonymization and clustering

MFCS'11 Proceedings of the 36th international conference on Mathematical foundations of computer science
The effect of homogeneity on the complexity of k-anonymity

FCT'11 Proceedings of the 18th international conference on Fundamentals of computation theory
The power of the dinur-nissim algorithm: breaking privacy of statistical and graph databases

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Net and prune: a linear time algorithm for euclidean distance problems

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Parameterized complexity of k-anonymity: hardness and tractability

Journal of Combinatorial Optimization
A refined complexity analysis of degree anonymization in graphs

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II
The effect of homogeneity on the computational complexity of combinatorial data anonymization

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of deidentifying records is to remove identifying fields such as social security number, name, etc. However, recent research has shown that a large fraction of the U.S. population can be identified using nonkey attributes (called quasi-identifiers) such as date of birth, gender, and zip code. The k-anonymity model protects privacy via requiring that nonkey attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a prespecified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an ε fraction of points to remain unclustered, that is, deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.