Algorithms for clustering data
Algorithms for clustering data
Nonlinear component analysis as a kernel eigenvalue problem
Neural Computation
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
The Effect of the Input Density Distribution on Kernel-based Classifiers
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Segmentation Using Eigenvectors: A Unifying View
ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Lucene in Action (In Action series)
Lucene in Action (In Action series)
Implementation of Kernel Methods on the GPU
DICTA '05 Proceedings of the Digital Image Computing on Techniques and Applications
Lower bounds on locality sensitive hashing
Proceedings of the twenty-second annual symposium on Computational geometry
An Experimental Study on Pedestrian Classification
IEEE Transactions on Pattern Analysis and Machine Intelligence
The dynamics of viral marketing
ACM Transactions on the Web (TWEB)
Approximation algorithms for co-clustering
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Multilingual spectral clustering using document similarity propagation
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Predicting positive and negative links in online social networks
Proceedings of the 19th international conference on World wide web
Web-scale computer vision using MapReduce for multimedia data mining
Proceedings of the Tenth International Workshop on Multimedia Data Mining
Parallel Spectral Clustering in Distributed Systems
IEEE Transactions on Pattern Analysis and Machine Intelligence
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
IEEE Transactions on Pattern Analysis and Machine Intelligence
Mahout in Action
Approximating a gram matrix for improved kernel-based learning
COLT'05 Proceedings of the 18th annual conference on Learning Theory
Hi-index | 0.00 |
Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.