Re-identifying register data by survey data using cluster analysis: an empirical study

Authors:
Johann Bacher;Ruth Brand;Stefan Bender
Affiliations:
Chair of Sociology, University of Erlangen-Nuremberg, Findelgasse 7-9, D-90401 Nuernberg, Germany;Statistisches Bundesamt, Gustav Stresemann Ring 11, D-65189 Wiesbaden, Germany;Institute for Employment Research, Regensburger Str. 104, D-90327 Nuernberg, Germany
Venue:
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Year:
2002

Citing 1
Cited 3

Microdata Protection through Noise Addition

Inference Control in Statistical Databases, From Theory to Practice

On the connections between statistical disclosure control for microdata and some artificial intelligence tools

Information Sciences—Informatics and Computer Science: An International Journal
Disclosure risk assessment in statistical microdata protection via advanced record linkage

Statistics and Computing
Selecting potentially relevant records using re-identification methods

New Generation Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

More and more empirical researchers from universities or research centres like to use register or survey data collected by statistical agencies or the social security system, since these data can by used for several empirical studies, e.g. the analysis of special groups or quantitative effects of economic or social policies. Most of the data required have to be (factually) anonymised before they are disseminated to preserve confidentiality. In the area of statistics on households and individuals this path has been pursued in Germany for several years. The transmission of de facto anonymised datafiles has proved to be a good form of co-operation between scientists and statisticians.Factual anonymity of the data depends on the costs and benefits of a potential reidentification. The paper assumes that the intruder only accepts low costs. Therefore he uses a cluster analysis module that is available in a standard statistical software package to re-identify persons. After a description of the method different factors influencing the re-identification risk are studied using German employment statistics (register data) and the German Life History Study (survey data). The factors are: sample fraction and number of (irrelevant) variables. The results show, that the number of identifiable persons is remarkable high. Furthermore it can be confirmed with the cluster analysis that the number of re-identifiable records increases with increasing sampling fraction and that irrelevant variables reduce this number.