Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

Authors:
Daniel R. Jeske;Behrokh Samadi;Pengyue J. Lin;Lan Ye;Sean Cox;Rui Xiao;Ted Younglove;Minh Ly;Douglas Holt;Ryan Rich
Affiliations:
University of California, Riverside, CA;Lucent Technologies, Holmdel, NJ;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 4
Cited 6

Distributed Data Mining in Credit Card Fraud Detection

IEEE Intelligent Systems
Using ethnography to design a mass detection tool (MDT) for the early discovery of insurance fraud

CHI '03 Extended Abstracts on Human Factors in Computing Systems
Generation of Synthetic Training Data for an HMM-based Handwriting Recognition System

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A graph model for unsupervised lexical acquisition

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

Beyond Homemade Artificial Data Sets

HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
Synthetic data generation capabilties for testing data mining tools

MILCOM'06 Proceedings of the 2006 IEEE conference on Military communications
3LSPG: forensic tool evaluation by three layer stochastic process-based generation of data

IWCF'10 Proceedings of the 4th international conference on Computational forensics
A Bayes-true data generator for evaluation of supervised and unsupervised learning methods

Pattern Recognition Letters
Generation of training database using a noise model for OCR systems
Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Discovery and Analysis Systems (IDAS) are designed to correlate multiple sources of data and use data mining techniques to identify potential significant events. Application domains for IDAS are numerous and include the emerging area of homeland security.Developing test cases for an IDAS requires background data sets into which hypothetical future scenarios can be overlaid. The IDAS can then be measured in terms of false positive and false negative error rates. Obtaining the test data sets can be an obstacle due to both privacy issues and also the time and cost associated with collecting a diverse set of data sources.In this paper, we give an overview of the design and architecture of an IDAS Data Set Generator (IDSG) that enables a fast and comprehensive test of an IDAS. The IDSG generates data using statistical and rule-based algorithms and also semantic graphs that represent interdependencies between attributes. A credit card transaction application is used to illustrate the approach.