A computational algorithm for handling the special uniques problem

Authors:
M. J. Elliot;A. M. Manning;R. W. Ford
Affiliations:
Cathie Marsh Center for Census and Survey Research (CCSR), Manchester University, M13 9PL, UK;Department of Computer Science, Manchester University, M13 9PL, UK;Department of Computer Science, Manchester University, M13 9PL, UK
Venue:
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Year:
2002

Citing 7
Cited 3

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Induction of Decision Trees

Machine Learning
Integrating File and Record Level Disclosure Risk Assessment

Inference Control in Statistical Databases, From Theory to Practice
OPTICS-OF: Identifying Local Outliers

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery

A New Algorithm for Finding Minimal Sample Uniques for Use in Statistical Disclosure Assessment

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Posterior distributions for rare events in multivariate categorical data

ISP'07 Proceedings of the 6th WSEAS international conference on Information security and privacy
Improving record linkage with supervised learning for disclosure risk assessment

Information Fusion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many organizations require detailed individual-level information, much of which has been collected under guarantees of confidentiality. However, simple anonymization procedures, i.e. removing names and addresses, are insufficient for this to be ensured. The records belonging to certain individuals have a high probability of being identified (as their contents, or attributes, are unusual) and therefore have the potential to be recognized spontaneously - such records are referred to as special uniques. Consider, for example, a sixteen-year-old widow in a population survey. Confidentiality of a given dataset cannot be enabled until all special unique records are identified and either disguised or removed. However, to the knowledge of the authors, no exhaustive automated analysis of this nature has been conducted due to the demanding levels of computation and data storage that are required. This paper introduces a new algorithm that locates 'risky' records in discrete data by first identifying all unique attribute sets (up to a user-specified maximum size) and secondly by grading the 'risk' of each record by considering the number and distribution of unique attribute sets within each record. Empirical tests indicate that the algorithm is highly effective at picking out 'risky' records from large samples of data.