Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Authors:
Hwanjo Yu;Jiong Yang;Jiawei Han;Xiaolei Li
Affiliations:
Department of Computer Science, University of Iowa, Iowa, USA;Department of Computer Science, Case Western Reserve University, Ohio, USA;Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA;Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA
Venue:
Data Mining and Knowledge Discovery
Year:
2005

Citing 18
Cited 8

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical

Advances in kernel methods
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Proximal support vector machine classifiers

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Shrinkage estimator generalizations of Proximal Support Vector Machines

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Large Datasets in Arbitrary Metric Spaces

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
SVMTorch: support vector machines for large-scale regression problems

The Journal of Machine Learning Research
Finding the most interesting patterns in a database quickly by using sequential sampling

The Journal of Machine Learning Research
Training ν-Support Vector Classifiers: Theory and Algorithms

Neural Computation

Learning concepts from large scale imbalanced data sets using support cluster machines

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Nonlinear clustering-based support vector machine for large data sets

Optimization Methods & Software - Mathematical programming in data mining and machine learning
Fast Local Support Vector Machines for Large Datasets

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Block-quantized support vector ordinal regression

IEEE Transactions on Neural Networks
Tree Decomposition for Large-Scale SVM Problems

The Journal of Machine Learning Research
A fast data preprocessing procedure for support vector regression

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Research of granular support vector machine

Artificial Intelligence Review
Granular support vector machine based on mixed measure

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.