Efficient dataset size reduction by finding homogeneous clusters

Authors:
Stefanos Ougiaroglou;Georgios Evangelidis
Affiliations:
University of Mecedonia, Thessaloniki, Greece;University of Mecedonia, Thessaloniki, Greece
Venue:
Proceedings of the Fifth Balkan Conference in Informatics
Year:
2012

Citing 15
Cited 0

Instance-Based Learning Algorithms

Machine Learning
A sample set condensation algorithm for the class sensitive artificial neural network

Pattern Recognition Letters
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
Learning Symbolic Prototypes

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Self-generating prototypes for pattern classification

Pattern Recognition
Finding Prototypes For Nearest Neighbor Classifiers

IEEE Transactions on Computers
A review of instance selection methods

Artificial Intelligence Review
Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study

IEEE Transactions on Pattern Analysis and Machine Intelligence
The condensed nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory
The reduced nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory
An algorithm for a selective nearest neighbor decision rule (Corresp.)

IEEE Transactions on Information Theory
A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although the k-Nearest Neighbor classifier is one of the most widely-used classification methods, it suffers from the high computational cost and storage requirements it involves. These major drawbacks have constituted an active research field over the last decades. This paper proposes an effective data reduction algorithm that has low preprocessing cost and reduces storage requirements while maintaining classification accuracy at an acceptable high level. The proposed algorithm is based on a fast pre-processing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training-set. Experimental results, based on real-life datasets, illustrate that the proposed algorithm is faster and achieves higher reduction rates than three known existing methods, while it does not significantly reduce the classification accuracy.