Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

  • Authors:
  • Daniel R. Jeske;Behrokh Samadi;Pengyue J. Lin;Lan Ye;Sean Cox;Rui Xiao;Ted Younglove;Minh Ly;Douglas Holt;Ryan Rich

  • Affiliations:
  • University of California, Riverside, CA;Lucent Technologies, Holmdel, NJ;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA;University of California, Riverside, CA

  • Venue:
  • Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information Discovery and Analysis Systems (IDAS) are designed to correlate multiple sources of data and use data mining techniques to identify potential significant events. Application domains for IDAS are numerous and include the emerging area of homeland security.Developing test cases for an IDAS requires background data sets into which hypothetical future scenarios can be overlaid. The IDAS can then be measured in terms of false positive and false negative error rates. Obtaining the test data sets can be an obstacle due to both privacy issues and also the time and cost associated with collecting a diverse set of data sources.In this paper, we give an overview of the design and architecture of an IDAS Data Set Generator (IDSG) that enables a fast and comprehensive test of an IDAS. The IDSG generates data using statistical and rule-based algorithms and also semantic graphs that represent interdependencies between attributes. A credit card transaction application is used to illustrate the approach.