Re-identifying register data by survey data using cluster analysis: an empirical study

  • Authors:
  • Johann Bacher;Ruth Brand;Stefan Bender

  • Affiliations:
  • Chair of Sociology, University of Erlangen-Nuremberg, Findelgasse 7-9, D-90401 Nuernberg, Germany;Statistisches Bundesamt, Gustav Stresemann Ring 11, D-65189 Wiesbaden, Germany;Institute for Employment Research, Regensburger Str. 104, D-90327 Nuernberg, Germany

  • Venue:
  • International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

More and more empirical researchers from universities or research centres like to use register or survey data collected by statistical agencies or the social security system, since these data can by used for several empirical studies, e.g. the analysis of special groups or quantitative effects of economic or social policies. Most of the data required have to be (factually) anonymised before they are disseminated to preserve confidentiality. In the area of statistics on households and individuals this path has been pursued in Germany for several years. The transmission of de facto anonymised datafiles has proved to be a good form of co-operation between scientists and statisticians.Factual anonymity of the data depends on the costs and benefits of a potential reidentification. The paper assumes that the intruder only accepts low costs. Therefore he uses a cluster analysis module that is available in a standard statistical software package to re-identify persons. After a description of the method different factors influencing the re-identification risk are studied using German employment statistics (register data) and the German Life History Study (survey data). The factors are: sample fraction and number of (irrelevant) variables. The results show, that the number of identifiable persons is remarkable high. Furthermore it can be confirmed with the cluster analysis that the number of re-identifiable records increases with increasing sampling fraction and that irrelevant variables reduce this number.