Performance Controlled Data Reduction for Knowledge Discovery in Distributed Databases

Authors:
Slobodan Vucetic;Zoran Obradovic
Affiliations:
-;-
Venue:
PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Year:
2000

Citing 3
Cited 4

Introduction to data compression

Introduction to data compression
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Comparison of neural networks and discriminant analysis in predicting forest cover types

Comparison of neural networks and discriminant analysis in predicting forest cover types

An agent-based framework for distributed learning

Engineering Applications of Artificial Intelligence
Cluster integration for the cluster-based instance selection

ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume PartI
A new cluster-based instance selection algorithm

KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
Experimental evaluation of the agent-based population learning algorithm for the cluster-based instance selection

ICCCI'11 Proceedings of the Third international conference on Computational collective intelligence: technologies and applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The objective of data reduction is to obtain a compact representation of a large data set to facilitate repeated use of non-redundant information with complex and slow learning algorithms and to allow efficient data transfer and storage. For a user-controllable allowed accuracy loss we propose an effective data reduction procedure based on guided sampling for identifying a minimal size representative subset, followed by a model-sensitivity analysis for determining an appropriate compression level for each attribute. Experiments were performed on 3 large data sets and, depending on an allowed accuracy loss margin ranging from 1percnt to 5percnt of the ideal generalization, the achieved compression rates ranged between 95 and 12,500 times. These results indicate that transferring reduced data sets from multiple locations to a centralized site for an efficient and accurate knowledge discovery might often be possible in practice.