Hierarchical Decision Tree Induction in Distributed Genomic Databases

Authors:
Amir Bar-Or;Daniel Keren;Assaf Schuster;Ran Wolff
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 13
Cited 6

BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel Formulations of Decision-Tree Classification Algorithms

Data Mining and Knowledge Discovery
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

Data Mining and Knowledge Discovery
Induction of Decision Trees

Machine Learning
Model-based inference of haplotype block variation

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Theoretical Comparison between the Gini Index and Information Gain Criteria

Annals of Mathematics and Artificial Intelligence
A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

International Journal of Hybrid Intelligent Systems
Scaling up: distributed machine learning with cooperation

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Scribe: a large-scale and decentralized application-level multicast infrastructure

IEEE Journal on Selected Areas in Communications

Empirical investigation on knowledge packaging supporting risk management in software processes

SE'07 Proceedings of the 25th conference on IASTED International Multi-Conference: Software Engineering
Learning Classifiers from Large Databases Using Statistical Queries

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Resource aware distributed knowledge discovery

Ubiquitous knowledge discovery
Resource aware distributed knowledge discovery

Ubiquitous knowledge discovery
Business process lines and decision tables driving flexibility by selection

SC'12 Proceedings of the 11th international conference on Software Composition
Distributed Privacy-Preserving Decision Support System for Highly Imbalanced Clinical Data

ACM Transactions on Management Information Systems (TMIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.