Flocks, herds and schools: A distributed behavioral model
SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
Stemming and its effects on TFIDF ranking (poster session)
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Fast computation of database operations using graphics processors
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fast and approximate stream mining of quantiles and frequencies using graphics processors
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Parallel simulation of group behaviors
WSC '04 Proceedings of the 36th conference on Winter simulation
A flocking based algorithm for document clustering analysis
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Nature-inspired applications and systems
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
ICMLA '06 Proceedings of the 5th International Conference on Machine Learning and Applications
Linear-Time Computation of Similarity Measures for Sequential Data
The Journal of Machine Learning Research
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Clustering billions of data points using GPUs
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems
Journal of Parallel and Distributed Computing
Parallel approaches to machine learning-A comprehensive survey
Journal of Parallel and Distributed Computing
MicroClAn: Microarray clustering analysis
Journal of Parallel and Distributed Computing
Modelling and Simulation in Engineering
Hi-index | 0.00 |
Document clustering is a central method to mine massive amounts of data. Due to the explosion of raw documents generated on the Internet and the necessity to analyze them efficiently in various intelligent information systems, clustering techniques have reached their limitations on single processors. Instead of single processors, general-purpose multi-core chips are increasingly deployed in response to diminishing returns in single-processor speedup due to the frequency wall, but multi-core benefits only provide linear speedups while the number of documents in the Internet is growing exponentially. Accelerating hardware devices represent a novel promise for improving the performance for data-intensive problems such as document clustering. They offer more radical designs with a higher level of parallelism but adaptation to novel programming environments. In this paper, we assess the benefits of exploiting the computational power of graphics processing units (GPUs) to study two fundamental problems in document mining, namely to calculate the term frequency-inverse document frequency (TF-IDF) and cluster a large set of documents. We transform traditional algorithms into accelerated parallel counterparts that can be efficiently executed on many-core GPU architectures. We assess our implementations on various platforms, ranging from stand-alone GPU desktops to Beowulf-like clusters equipped with contemporary GPU cards. We observe at least one order of magnitude speedups over CPU-only desktops and clusters. This demonstrates the potential of exploiting GPU clusters to efficiently solve massive document mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.