Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Algorithms for clustering data
Algorithms for clustering data
Instance-Based Learning Algorithms
Machine Learning
Factoring: a method for scheduling parallel loops
Communications of the ACM
Using processor affinity in loop scheduling on shared-memory multiprocessors
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Self-scheduling on distributed-memory machines
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers
IEEE Transactions on Parallel and Distributed Systems
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Mars: a MapReduce framework on graphics processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Proceedings of the 24th ACM International Conference on Supercomputing
Run-time optimizations for replicated dataflows on heterogeneous environments
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapCG: writing parallel program portable between CPU and GPU
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Using Shared Memory to Accelerate MapReduce on Graphics Processing Units
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Multi-GPU MapReduce on GPU Clusters
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
StreamMR: An Optimized MapReduce Framework for AMD GPUs
ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems
Coordinated energy management in heterogeneous processors
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Revisiting co-processing for hash joins on the coupled CPU-GPU architecture
Proceedings of the VLDB Endowment
Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures
Proceedings of Programming Models and Applications on Multicores and Manycores
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
The work presented here is driven by two observations. First, heterogeneous architectures that integrate a CPU and a GPU on the same chip are emerging, and hold much promise for supporting power-efficient and scalable high performance computing. Second, MapReduce has emerged as a suitable framework for simplified parallel application development for many classes of applications, including data mining and machine learning applications that benefit from accelerators. This paper focuses on the challenge of scaling a MapReduce application using the CPU and GPU together in an integrated architecture. We develop different methods for dividing the work, which are the map-dividing scheme, where map tasks are divided between both devices, and the pipelining scheme, which pipelines the map and the reduce stages on different devices. We develop dynamic work distribution schemes for both the approaches. To achieve high load balance while keeping scheduling costs low, we use a runtime tuning method to adjust task block sizes for the map-dividing scheme. Our implementation of MapReduce is based on a continuous reduction method, which avoids the memory overheads of storing key-value pairs. We have evaluated the different design decisions using 5 popular MapReduce applications. For 4 of the applications, our system achieves 1.21 to 2.1 speedup over the better of the CPU-only and GPU-only versions. The speedups over a single CPU core execution range from 3.25 to 28.68. The runtime tuning method we have developed achieves very low load imbalance, while keeping scheduling overheads low. Though our current work is specific to MapReduce, many underlying ideas are also applicable towards intra-node acceleration of other applications on integrated CPU-GPU nodes.