Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software

  • Authors:
  • Mark A. Whiting;Jereme Haack;Carrie Varley

  • Affiliations:
  • Pacific Northwest National Laboratory, Richland, Washington;Pacific Northwest National Laboratory, Richland, Washington;Pacific Northwest National Laboratory, Richland, Washington

  • Venue:
  • Proceedings of the 2008 Workshop on BEyond time and errors: novel evaLuation methods for Information Visualization
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe the Threat Stream Generator, a method and a toolset for creating realistic, synthetic test data for information analytics applications. Finding or creating useful test data sets is difficult for a team focused on creating solutions to information analysis problems. First, real data that might be considered good for testing analytic applications may not be available or may be classified. In the latter case, tool builders will not have the clearances needed to use, or even see, the data. Second, analysts' time is scarce and obtaining the needed characteristics of real data from them to create a test data set is difficult. Finally, generating good test data is challenging. Commercial data generators are focused on large database testing, not information analytics tool testing. Our distinctive contribution is that we embed known ground truth in a test data set, so that tool developers and others will be able to determine the effectiveness of their software and how they are progressing in their support for information analysts. Our automated methods also significantly decrease data set development time. We review our approach to scenario development, threat insertion strategies, data set development, and data set evaluation. We also discuss our recent successes in using our data in open analytic competitions.