Performance and Memory-Access Characterization of Data Mining Applications

  • Authors:
  • Jeffrey P. Bradford;José Fortes

  • Affiliations:
  • -;-

  • Venue:
  • WWC '98 Proceedings of the Workload Characterization: Methodology and Case Studies
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. Several properties of the program are noted. Out-of-order dispatch and multiple issue provide a significant performance advantage: 50\%--250\% improvement in IPC for out-of-order dispatch versus in-order dispatch, and 5\%--120\% improvement in IPC for four-way issue versus single issue. Multiple issue provides a greater performance improvement for larger L2 cache sizes, when the program is limited by CPU performance; out-of-order dispatch provides a greater performance improvement for smaller L2 cache sizes. The program has a very small instruction footprint: for an 8-kB L1 instruction cache the instruction miss rate is below 0.1\%. A small (8 kB) L1 data cache is sufficient to capture most of the locality of the data references, resulting in L1 miss rates between 10\%--20\%. Increasing the size of the L2 data cache does not significantly improve performance until a significant fraction (over 1/4) of the dataset fits into the L2 cache. Lastly, a procedure is developed for scaling the cache sizes when using scaled-down datasets, allowing the results for smaller datasets to be used to predict the performance of full-sized datasets.