A Bayes-true data generator for evaluation of supervised and unsupervised learning methods

Authors:
Janick V. Frasch;Aleksander Lodwich;Faisal Shafait;Thomas M. Breuel
Affiliations:
German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany and University of Kaiserslautern, 67663 Kaiserslautern, Germany;German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany;German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany;University of Kaiserslautern, 67663 Kaiserslautern, Germany
Venue:
Pattern Recognition Letters
Year:
2011

Citing 5
Cited 1

DataGen: a generator of datasets for evaluation of classification algorithms

Pattern Recognition Letters
Multisection in Interval Branch-and-Bound Methods for Global Optimization – I. Theoretical Results

Journal of Global Optimization
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
No free lunch theorems for optimization

IEEE Transactions on Evolutionary Computation

An n-spheres based synthetic data generator for supervised classification

IWANN'13 Proceedings of the 12th international conference on Artificial Neural Networks: advances in computational intelligence - Volume Part I

Quantified Score

Hi-index	0.10

Visualization

Abstract

Benchmarking pattern recognition, machine learning and data mining methods commonly relies on real-world data sets. However, there are some disadvantages in using real-world data. On one hand collecting real-world data can become difficult or impossible for various reasons, on the other hand real-world variables are hard to control, even in the problem domain; in the feature domain, where most statistical learning methods operate, exercising control is even more difficult and hence rarely attempted. This is at odds with the scientific experimentation guidelines mandating the use of as directly controllable and as directly observable variables as possible. Because of this, synthetic data possesses certain advantages over real-world data sets. In this paper we propose a method that produces synthetic data with guaranteed global and class-specific statistical properties. This method is based on overlapping class densities placed on the corners of a regular k-simplex. This generator can be used for algorithm testing and fair performance evaluation of statistical learning methods. Because of the strong properties of this generator researchers can reproduce each others experiments by knowing the parameters used, instead of transmitting large data sets.