FPMR: MapReduce framework on FPGA
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Web-scale computer vision using MapReduce for multimedia data mining
Proceedings of the Tenth International Workshop on Multimedia Data Mining
Very large pattern databases for heuristic search
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapCG: writing parallel program portable between CPU and GPU
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Gossamer: a lightweight programming framework for multicore machines
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
A MapReduce approach to Gi*(d) spatial statistic
Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
Behavioral simulations in MapReduce
Proceedings of the VLDB Endowment
Garbage collection auto-tuning for Java mapreduce on multi-cores
Proceedings of the international symposium on Memory management
Phoenix++: modular MapReduce for shared-memory systems
Proceedings of the second international workshop on MapReduce and its applications
ETLMR: a highly scalable dimensional ETL framework based on mapreduce
DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Elastic phoenix: malleable mapreduce for shared-memory systems
NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
A resistive TCAM accelerator for data-intensive computing
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing MapReduce for GPUs with effective shared memory usage
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A Map-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
PARDIS: a programmable memory controller for the DDRx interfacing standards
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hierarchical merge for scalable MapReduce
Proceedings of the 2012 workshop on Management of big data systems
Accelerating MapReduce on a coupled CPU-GPU architecture
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
ADAPT: A framework for coscheduling multithreaded programs
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Fast parallel algorithms for blocked dense matrix multiplication on shared memory architectures
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
G-Hadoop: MapReduce across distributed data centers for data-intensive computing
Future Generation Computer Systems
Accelerating text mining workloads in a MapReduce-based distributed GPU environment
Journal of Parallel and Distributed Computing
Cogset: a high performance MapReduce engine
Concurrency and Computation: Practice & Experience
Grex: An efficient MapReduce framework for graphics processing units
Journal of Parallel and Distributed Computing
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling
ACM Transactions on Architecture and Code Optimization (TACO)
Producer-Consumer: the programming model for future many-core processors
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Active disk meets flash: a case for intelligent SSDs
Proceedings of the 27th international ACM conference on International conference on supercomputing
Managing the Quality vs. Efficiency Trade-off Using Dynamic Effort Scaling
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Probabilistic Embedded Computing
Detection of false sharing using machine learning
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
HAT: history-based auto-tuning MapReduce in heterogeneous environments
The Journal of Supercomputing
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions
Journal of Grid Computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
A lightweight infrastructure for graph analytics
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Scale-up vs scale-out for Hadoop: time to rethink?
Proceedings of the 4th annual Symposium on Cloud Computing
A programmable memory controller for the DDRx interfacing standards
ACM Transactions on Computer Systems (TOCS)
Hone: "Scaling down" Hadoop on shared-memory systems
Proceedings of the VLDB Endowment
Power-aware dynamic memory management on many-core platforms utilizing DVFS
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
DESC: energy-efficient data exchange using synchronized counters
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Dynamic runtimes can simplify parallel programming by automatically managing concurrency and locality without further burdening the programmer. Nevertheless, implementing such runtime systems for large-scale, shared-memory systems can be challenging. This work optimizes Phoenix, a MapReduce runtime for shared-memory multi-cores and multiprocessors, on a quad-chip, 32-core, 256-thread UltraSPARC T2+ system with NUMA characteristics. We show how a multi-layered approach that comprises optimizations on the algorithm, implementation, and OS interaction leads to significant speedup improvements with 256 threads (average of 2.5脳 higher speedup, maximum of 19脳). We also identify the roadblocks that limit the scalability of parallel runtimes on shared-memory systems, which are inherently tied to the OS scalability on large-scale systems.