Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
Neural Networks: A Comprehensive Foundation
Neural Networks: A Comprehensive Foundation
Genetic Algorithms in Search, Optimization and Machine Learning
Genetic Algorithms in Search, Optimization and Machine Learning
Machine Learning
Integrating File and Record Level Disclosure Risk Assessment
Inference Control in Statistical Databases, From Theory to Practice
OPTICS-OF: Identifying Local Outliers
PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
A New Algorithm for Finding Minimal Sample Uniques for Use in Statistical Disclosure Assessment
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Posterior distributions for rare events in multivariate categorical data
ISP'07 Proceedings of the 6th WSEAS international conference on Information security and privacy
Hi-index | 0.00 |
Many organizations require detailed individual-level information, much of which has been collected under guarantees of confidentiality. However, simple anonymization procedures, i.e. removing names and addresses, are insufficient for this to be ensured. The records belonging to certain individuals have a high probability of being identified (as their contents, or attributes, are unusual) and therefore have the potential to be recognized spontaneously - such records are referred to as special uniques. Consider, for example, a sixteen-year-old widow in a population survey. Confidentiality of a given dataset cannot be enabled until all special unique records are identified and either disguised or removed. However, to the knowledge of the authors, no exhaustive automated analysis of this nature has been conducted due to the demanding levels of computation and data storage that are required. This paper introduces a new algorithm that locates 'risky' records in discrete data by first identifying all unique attribute sets (up to a user-specified maximum size) and secondly by grading the 'risk' of each record by considering the number and distribution of unique attribute sets within each record. Empirical tests indicate that the algorithm is highly effective at picking out 'risky' records from large samples of data.