When random sampling preserves privacy

Authors:
Kamalika Chaudhuri;Nina Mishra
Affiliations:
Computer Science Department, UC Berkeley, Berkeley, CA;Computer Science Department, University of Virginia, Charlottesville, VA
Venue:
CRYPTO'06 Proceedings of the 26th annual international conference on Advances in Cryptology
Year:
2006

Citing 8
Cited 14

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Revealing information while preserving privacy

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Limiting privacy breaches in privacy preserving data mining

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Simulatable auditing

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Practical privacy: the SuLQ framework

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy via pseudorandom sketches

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Toward privacy in public databases

TCC'05 Proceedings of the Second international conference on Theory of Cryptography
Calibrating noise to sensitivity in private data analysis

TCC'06 Proceedings of the Third conference on Theory of Cryptography

Composition attacks and auxiliary information in data privacy

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Releasing search queries and clicks privately

Proceedings of the 18th international conference on World wide web
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards Fair Leader Election in Wireless Networks

ADHOC-NOW '09 Proceedings of the 8th International Conference on Ad-Hoc, Mobile and Wireless Networks
Privacy-Preserving Data Publishing

Foundations and Trends in Databases
Practical universal random sampling

IWSEC'10 Proceedings of the 5th international conference on Advances in information and computer security
No free lunch in data privacy

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Differentially Private Empirical Risk Minimization

The Journal of Machine Learning Research
Optimal sampling from sliding windows

Journal of Computer and System Sciences
A rigorous and customizable framework for privacy

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Approximately optimal auctions for selling privacy when costs are correlated with data

Proceedings of the 13th ACM Conference on Electronic Commerce
On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy

Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security
Pufferfish: A framework for mathematical privacy definitions

ACM Transactions on Database Systems (TODS)
A near-optimal algorithm for differentially-private principal components

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many organizations such as the U.S. Census publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques and then a small sample is released so as to enable “do-it-yourself” calculations. This paper investigates the privacy of the second step of this process: sampling. We observe that rare values – values that occur with low frequency in the table – can be problematic from a privacy perspective. To our knowledge, this is the first work that quantitatively examines the relationship between the number of rare values in a table and the privacy in a released random sample. If we require ε-privacy (where the larger ε is, the worse the privacy guarantee) with probability at least 1 – δ, we say that a value is rare if it occurs in at most $\tilde{O}(\frac{1}{\epsilon})$ rows of the table (ignoring log factors). If there are no rare values, then we establish a direct connection between sample size that is safe to release and privacy. Specifically, if we select each row of the table with probability at most ε then the sample is O(ε)-private with high probability. In the case that there are t rare values, then the sample is $\tilde{O}(\epsilon \delta /t)$-private with probability at least 1–δ.