Data Selection Using SASH Trees for Support Vector Machines

Authors:
Chaofan Sun;Ricardo Vilalta
Affiliations:
Department of Computer Science, University of Houston, 4800 Calhoun Rd., Houston TX, 77204-3010 Email: vilalta@cs.uh.edu,;Center for Research and Advanced Studies (CINVESTAV), Av. Científica 1145, Guadalajara, 45010, México
Venue:
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2007

Citing 6
Cited 0

Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
SVM-KM: Speeding SVMs Learning with a priori Cluster Selection and k-Means

SBRN '00 Proceedings of the VI Brazilian Symposium on Neural Networks (SBRN'00)
Classifying large data sets using SVMs with hierarchical clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SVM in oracle database 10g: removing the barriers to widespread adoption of support vector machines

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Fast pattern selection for support vector classifiers

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Training data selection for support vector machines

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a data preprocessing procedure to select support vector (SV) candidates. We select decision boundary region vectors (BRVs) as SV candidates. Without the need to use the decision boundary, BRVs can be selected based on a vector's nearest neighbor of opposite class (NNO). To speed up the process, two spatial approximation sample hierarchical (SASH) trees are used for estimating the BRVs. Empirical results show that our data selection procedure can reduce a full dataset to the number of SVs or only slightly higher. Training with the selected subset gives performance comparable to that of the full dataset. For large datasets, overall time spent in selecting and training on the smaller dataset is significantly lower than the time used in training on the full dataset.