Algorithms for clustering data
Algorithms for clustering data
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Static analysis to reduce synchronization costs in data-parallel programs
POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
Using integer sets for data-parallel program analysis and optimization
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
ICS '98 Proceedings of the 12th international conference on Supercomputing
Adaptive reduction parallelization techniques
Proceedings of the 14th international conference on Supercomputing
SQLEM: fast clustering in SQL using the EM algorithm
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques
Data mining: concepts and techniques
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey
Data Mining and Knowledge Discovery
Parallel Programming with Polaris
Computer
Parallel Mining of Association Rules
IEEE Transactions on Knowledge and Data Engineering
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
Photon mapping on programmable graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Compiler Optimization of Implicit Reductions for Distributed Memory Multiprocessors
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Exploring Graphics Processor Performance for General Purpose Applications
DSD '05 Proceedings of the 8th Euromicro Conference on Digital System Design
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A performance-oriented data parallel virtual machine for GPUs
ACM SIGGRAPH 2006 Sketches
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
Mars: a MapReduce framework on graphics processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallelizing user-defined and implicit reductions globally on multiprocessors
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Initial experiences porting a bioinformatics application to a graphics processor
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
A graphics hardware accelerated algorithm for nearest neighbor search
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part IV
Accelerating SQL database operations on a GPU with CUDA
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Proceedings of the 24th ACM International Conference on Supercomputing
An integer programming framework for optimizing shared memory use on GPUs
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
GRace: a low-overhead mechanism for detecting data races in GPU programs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the international conference on Supercomputing
A GPU-based high-throughput image retrieval algorithm
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Dense affinity propagation on clusters of GPUs
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Hi-index | 0.00 |
Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface. In this paper, we offer a solution that targets a specific class of applications, which are the data mining and scientific data analysis applications. Our work is driven by the observation that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms. In our solution, the programmers simply need to specify the sequential reduction loop(s) with some additional information about the parameters. We use program analysis and code generation to map the applications to a GPU. Several additional optimizations are also performed by the system. We have evaluated our system using three popular data mining applications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The main observations from our experiments are as follows. The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. The automatically generated version did not have any noticeable overheads compared to hand written codes. Finally, the optimizations performed in the system resulted in significant performance improvements.