A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

Authors:
Doina Caragea;Adrian Silvescu;Vasant Honavar
Affiliations:
Artificial Intelligence Research Laboratory, Computer Science Department, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1040, USA. {dcaragea, silvescu, honavar}@cs.iastate.edu;Artificial Intelligence Research Laboratory, Computer Science Department, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1040, USA. {dcaragea, silvescu, honavar}@cs.iastate.edu;Artificial Intelligence Research Laboratory, Computer Science Department, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1040, USA. {dcaragea, silvescu, honavar}@cs.iastate.edu
Venue:
International Journal of Hybrid Intelligent Systems
Year:
2004

Citing 6
Cited 26

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
MOCHA: a self-extensible database middleware system for distributed data sources

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Induction of Decision Trees

Machine Learning
Knowledge Acquisition form Examples Vis Multiple Models

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Pattern discovery in distributed databases

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Learn++: an incremental learning algorithm for supervised neuralnetworks

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Hierarchical Decision Tree Induction in Distributed Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data

Knowledge and Information Systems
A grid-based approach for enterprise-scale data mining

Future Generation Computer Systems - Special section: Data mining in grid computing environments
A grid-based approach for enterprise-scale data mining

Future Generation Computer Systems - Special section: Data mining in grid computing environments
Induction of multiclass multifeature split decision trees from distributed data

Pattern Recognition
Learning support vector machines from distributed data sources

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 4
Stochastic gradient boosted distributed decision trees

Proceedings of the 18th ACM conference on Information and knowledge management
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
Distributed Data Mining Methodology with Classification Model Example

ICCCI '09 Proceedings of the 1st International Conference on Computational Collective Intelligence. Semantic Web, Social Networks and Multiagent Systems
Prototype selection algorithms for distributed learning

Pattern Recognition
An A-team approach to learning classifiers from distributed data sources

KES-AMSTA'08 Proceedings of the 2nd KES International conference on Agent and multi-agent systems: technologies and applications
An A-Team approach to learning classifiers from distributed data sources

International Journal of Intelligent Information and Database Systems
Distributed data mining methodology for clustering and classification model

ICAISC'10 Proceedings of the 10th international conference on Artificial intelligence and soft computing: Part I
An agent-based framework for distributed learning

Engineering Applications of Artificial Intelligence
Distributed threshold querying of general functions by a difference of monotonic representation

Proceedings of the VLDB Endowment
A generic parallel processing model for facilitating data mining and integration

Parallel Computing
From centralized to distributed decision tree induction using CHAID and fisher's linear discriminant function algorithms

Intelligent Decision Technologies
Distributed learning with data reduction

Transactions on computational collective intelligence IV
Global peer-to-peer classification in mobile ad-hoc networks: a requirements analysis

CONTEXT'11 Proceedings of the 7th international and interdisciplinary conference on Modeling and using context
Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Learning classifiers from distributed, ontology-extended data sources

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Learning ontology-aware classifiers

DS'05 Proceedings of the 8th international conference on Discovery Science
Information integration and knowledge acquisition from semantically heterogeneous biological data sources

DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
Decision trees: a recent overview

Artificial Intelligence Review
Distributed Privacy-Preserving Decision Support System for Highly Imbalanced Clinical Data

ACM Transactions on Management Information Systems (TMIS)
A hybrid decision tree classifier

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper motivates and precisely formulates the problem of learning from distributed data; describes a general strategy for transforming traditional machine learning algorithms into algorithms for learning from distributed data; demonstrates the application of this strategy to devise algorithms for decision tree induction from distributed data; and identifies the conditions under which the algorithms in the distributed setting are superior to their centralized counterparts in terms of time and communication complexity. The resulting algorithms are provably exact in that the decision tree constructed from distributed data is identical to that obtained in the centralized setting. Some natural extensions leading to algorithms for learning from heterogeneous distributed data and learning under privacy constraints are outlined.