Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Design, implementation, and evaluation of the linear road bnchmark on the stream processing core
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Adaptive Control of Extreme-scale Stream Processing Systems
ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Dynamic multigrain parallelization on the cell broadband engine
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Towards Autonomic Fault Recovery in System-S
ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
SPADE: the system s declarative stream processing engine
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
SODA: An Optimizing Scheduler for Large-Scale Stream-Based Distributed Computer Systems
Middleware '08 Proceedings of the ACM/IFIP/USENIX 9th International Middleware Conference
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluating multi-core platforms for HPC data-intensive kernels
Proceedings of the 6th ACM conference on Computing frontiers
Scale-Up Strategies for Processing High-Rate Data Streams in System S
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Proceedings of the 2nd Workshop on High Performance Computational Finance
Multi-core acceleration of chemical kinetics for simulation and prediction
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Machine learning-based prefetch optimization for data center applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Processing data streams with hard real-time constraints on heterogeneous systems
Proceedings of the international conference on Supercomputing
Ultra low latency market data feed on IBM PowerENTM
Computer Science - Research and Development
Rapid detection of rare geospatial events: earthquake warning applications
Proceedings of the 5th ACM international conference on Distributed event-based system
High performance content-based matching using GPUs
Proceedings of the 5th ACM international conference on Distributed event-based system
Low latency complex event processing on parallel hardware
Journal of Parallel and Distributed Computing
An embedded co-processor for accelerating window joins over uncertain data streams
Microprocessors & Microsystems
WHPCF '13 Proceedings of the 6th Workshop on High Performance Computational Finance
Hi-index | 0.00 |
We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures is still an open question. Streaming aggregation is obviously data parallel, but in practice its performance relies more on efficient data movement than computation, as we will demonstrate. Furthermore, we used workloads such as stock market data, which introduces unique data distribution problems. The three parallel architectures we use in our study are an Intel Core 2 Quad processor, an Nvidia GTX 285 GPU and the IBM PowerXCell 8i, an enhanced version of the Cell Broadband Engine architecture. Our implementations use OpenMP, CUDA and Cellgen (a compiler for OpenMP-like support on Cell) respectively. We find that the Cell's programmable local storage, and its low latency, high bandwidth access to main memory are best suited for parallelizing streaming aggregation. GPUs in the future can overcome the latency and bandwidth limitations by being fully integrated in the system's memory hierarchy. In order to attain good performance on existing parallel architectures, we find that developers must characterize their problem in terms of communication versus computation costs; memory access patterns, including assessing whether their algorithms reuse data; and the granularity of data access patterns.