Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets

Authors:
Fabrizio Angiulli;Gianluigi Folino
Affiliations:
-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 20
Cited 3

Toward memory-based reasoning

Communications of the ACM - Special issue on parallelism
Instance-Based Learning Algorithms

Machine Learning
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Voting over Multiple Condensed Nearest Neighbors

Artificial Intelligence Review - Special issue on lazy learning
Multidimensional access methods

ACM Computing Surveys (CSUR)
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Mining Very Large Databases with Parallel Processing

Mining Very Large Databases with Parallel Processing
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
Editorial

Artificial Intelligence Review - Special issue on lazy learning
Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface

Journal of Parallel and Distributed Computing - Special issue on computational grids
Round robin classification

The Journal of Machine Learning Research
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
Core Vector Machines: Fast SVM Training on Very Large Data Sets

The Journal of Machine Learning Research
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Fast condensed nearest neighbor rule

ICML '05 Proceedings of the 22nd international conference on Machine learning
Cover trees for nearest neighbor

ICML '06 Proceedings of the 23rd international conference on Machine learning
A modular k-nearest neighbor classification method for massively parallel text categorization

CIS'04 Proceedings of the First international conference on Computational and Information Science
Task decomposition and module combination based on class relations: a modular neural network for pattern classification

IEEE Transactions on Neural Networks
Fast minimization of structural risk by nearest neighbor rule

IEEE Transactions on Neural Networks

A grid-based architecture for nearest neighbor based condensation of huge datasets

UPGRADE '08 Proceedings of the third international workshop on Use of P2P, grid and agents for the development of content networks
Graph-Based Discrete Differential Geometry for Critical Instance Filtering

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Linear reconstruction measure steered nearest neighbor classification framework

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, PFCNN, a distributed method for computing a consistent subset of very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale-up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule.