Generating useful test data for complex linked employer-employee datasets

Authors:
Matthias Dorner;Jörg Drechsler;Peter Jacobebbinghaus
Affiliations:
Institute for Employment Research, Nuremberg, Germany;Institute for Employment Research, Nuremberg, Germany;Institute for Employment Research, Nuremberg, Germany
Venue:
PSD'12 Proceedings of the 2012 international conference on Privacy in Statistical Databases
Year:
2012

Citing 6
Cited 1

Learning to fly

Imitation in animals and artifacts
Case-Based Planning and Execution for Real-Time Strategy Games

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
PLOW: a collaborative task learning agent

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Learning Collaborative Behavior by Observation

ICMLA '10 Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications
A Case-Based Reasoning Framework for Developing Agents Using Learning by Observation

ICTAI '11 Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence
Artificial neural network ensemble approach for creating a negotiation model with ethical artificial agents

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II

Learning collaborative team behavior from observation

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

When data access for external researchers is difficult or time consuming it can be beneficial if test datasets that mimic the structure of the original data are disseminated in advance. With these test data researchers can develop their analysis code or can decide whether the data are suitable for their planned research before they go through the lengthly process of getting access at the research data center. The aim of these data is not to provide any meaningful results. Instead it is important to maintain the structure of the data as closely as possible including skip patterns, logical constraints between the variables, and longitudinal relationships so that any code that is developed using these test data will also run on the original data without further modifications. Achieving this goal can be challenging for complex datasets such as linked employer-employee datasets (LEED) where the links between the establishments and the employees also need to be maintained. Using the LEED of the Institute for Employment Research we illustrate how useful test data can be developed for such complex datasets. Our approach mainly relies on traditional statistical disclosure control (SDC) techniques such as data swapping and noise addition for data protection. Since statistical inferences need not be preserved, high swapping rates can be applied to sufficiently protect the data. At the same time it is straightforward to maintain the structure of the data by adding some constraints on the swapping procedure.