Large linear classification when data cannot fit in memory
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Large Linear Classification When Data Cannot Fit in Memory
ACM Transactions on Knowledge Discovery from Data (TKDD)
Towards a unified architecture for in-RDBMS analytics
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large linear classification when data cannot fit in memory
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
The Journal of Machine Learning Research
Hi-index | 0.00 |
It is an extreme challenge to produce a nonlinear SVM classifier on very large scale data. In this paper we describe a novel P-packSVM algorithm that can solve the Support Vector Machine (SVM) optimization problem with an arbitrary kernel. This algorithm embraces the best known stochastic gradient descent method to optimize the primal objective, and has 1/ϵ dependency in complexity to obtain a solution of optimization error ϵ. The algorithm can be highly parallelized with a special packing strategy, and experiences sub-linear speed-up with hundreds of processors. We demonstrate that P-packSVM achieves accuracy sufficiently close to that of SVM-light, and overwhelms the state-of-the-art parallel SVM trainer PSVM in both accuracy and efficiency. As an illustration, our algorithm trains CCAT dataset with 800k samples in 13 minutes and 95% accuracy, while PSVM needs 5 hours but only has 92% accuracy. We at last demonstrate the capability of P-packSVM on 8 million training samples.