Performance and Memory-Access Characterization of Data Mining Applications

Authors:
Jeffrey P. Bradford;José Fortes
Affiliations:
-;-
Venue:
WWC '98 Proceedings of the Workload Characterization: Methodology and Case Studies
Year:
1998

Citing 0
Cited 6

A characterization of data mining algorithms on a modern processor

DaMoN '05 Proceedings of the 1st international workshop on Data management on new hardware
Cache-conscious frequent pattern mining on modern and emerging processors

The VLDB Journal — The International Journal on Very Large Data Bases
A sampling-based scheduling method for distributed computing

CISST'09 Proceedings of the 3rd WSEAS international conference on Circuits, systems, signal and telecommunications
A sampling-based method for dynamic scheduling in distributed data mining environment

WSEAS Transactions on Computers
Performance characterization of data mining benchmarks

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
A Study on the Effect of Application and Resource Characteristics on the QoS in Service Provisioning Environments

International Journal of Distributed Systems and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. Several properties of the program are noted. Out-of-order dispatch and multiple issue provide a significant performance advantage: 50\%--250\% improvement in IPC for out-of-order dispatch versus in-order dispatch, and 5\%--120\% improvement in IPC for four-way issue versus single issue. Multiple issue provides a greater performance improvement for larger L2 cache sizes, when the program is limited by CPU performance; out-of-order dispatch provides a greater performance improvement for smaller L2 cache sizes. The program has a very small instruction footprint: for an 8-kB L1 instruction cache the instruction miss rate is below 0.1\%. A small (8 kB) L1 data cache is sufficient to capture most of the locality of the data references, resulting in L1 miss rates between 10\%--20\%. Increasing the size of the L2 data cache does not significantly improve performance until a significant fraction (over 1/4) of the dataset fits into the L2 cache. Lastly, a procedure is developed for scaling the cache sizes when using scaled-down datasets, allowing the results for smaller datasets to be used to predict the performance of full-sized datasets.