Using support vector machines for generating synthetic datasets

Authors:
Jörg Drechsler
Affiliations:
Institute for Employment Research, Nuremberg, Germany
Venue:
PSD'10 Proceedings of the 2010 international conference on Privacy in statistical databases
Year:
2010

Citing 7
Cited 1

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV

Advances in kernel methods
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Probability Estimates for Multi-class Classification by Pairwise Coupling

The Journal of Machine Learning Research
Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data

PSD '08 Proceedings of the UNESCO Chair in data privacy international conference on Privacy in Statistical Databases
Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel

Transactions on Data Privacy
Random Forests for Generating Partially Synthetic, Categorical Data

Transactions on Data Privacy

On sampling strategies for small and continuous data with the modeling of genetic programming and adaptive neuro-fuzzy inference system

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - FUZZYSS'2011: 2nd International Fuzzy Systems Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Generating synthetic datasets is an innovative approach for data dissemination. Values at risk of disclosure or even the entire dataset are replaced with multiple draws from statistical models. The quality of the released data strongly depends on the ability of these models to capture important relationships found in the original data. Defining useful models for complex survey data can be difficult and cumbersome. One possible approach to reduce the modeling burden for data disseminating agencies is to rely on machine learning tools to reveal important relationships in the data. This paper contains an initial investigation to evaluate whether support vector machines could be utilized to develop synthetic datasets. The application is limited to categorical data but extensions for continuous data should be straight forward. I briefly describe the concept of support vector machines and necessary adjustments for synthetic data generation. I evaluate the performance of the suggested algorithm using a real dataset, the IAB Establishment Panel. The results indicate that some data utility improvements might be achievable using support vector machines. However, these improvements come at the price of an increased disclosure risk compared to standard parametric modeling and more research is needed to find ways for reducing the risk. Some ideas for achieving this goal are provided in the discussion at the end of the paper.