Active disks: programming model, algorithms and evaluation
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Active Storage for Large-Scale Data Mining and Multimedia
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Accelerating data-intensive science with Gordon and Dash
Proceedings of the 2010 TeraGrid Conference
Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Modern scientific discovery often involves running complex application simulations on supercomputers, followed by a sequence of data analysis tasks on smaller clusters. This offline approach suffers from significant data movement costs such as redundant I/O, storage bandwidth bottleneck, and wasted CPU cycles, all of which contribute to increased energy consumption and delayed end-to-end performance. Technology projections for an exascale machine indicate that energy-efficiency will become the primary design metric. It is estimated that the energy cost of data movement will soon rival the cost of computation. Consequently, we can no longer ignore the data movement costs in data analysis. To address these challenges, we advocate executing data analysis tasks on emerging storage devices, such as SSDs. Typically, in extreme-scale systems, SSDs serve only as a temporary storage system for the simulation output data. In our approach, Active Flash, we propose to conduct in-situ data analysis on the SSD controller without degrading the performance of the simulation job. By migrating analysis tasks closer to where the data resides, it helps reduce the data movement cost. We present detailed energy and performance models for both active flash and offline strategies, and study them using extreme-scale application simulations, commonly used data analytics kernels, and supercomputer system configurations. Our evaluation suggests that active flash is a promising approach to alleviate the storage bandwidth bottleneck, reduce the data movement cost, and improve the overall energy efficiency.