BOAT—optimistic decision tree construction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining high-speed data streams
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel Formulations of Decision-Tree Classification Algorithms
Data Mining and Knowledge Discovery
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
Data Mining and Knowledge Discovery
Machine Learning
Model-based inference of haplotype block variation
RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
SLIQ: A Fast Scalable Classifier for Data Mining
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
SPRINT: A Scalable Parallel Classifier for Data Mining
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Theoretical Comparison between the Gini Index and Information Gain Criteria
Annals of Mathematics and Artificial Intelligence
International Journal of Hybrid Intelligent Systems
Scaling up: distributed machine learning with cooperation
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Scribe: a large-scale and decentralized application-level multicast infrastructure
IEEE Journal on Selected Areas in Communications
Empirical investigation on knowledge packaging supporting risk management in software processes
SE'07 Proceedings of the 25th conference on IASTED International Multi-Conference: Software Engineering
Learning Classifiers from Large Databases Using Statistical Queries
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Resource aware distributed knowledge discovery
Ubiquitous knowledge discovery
Resource aware distributed knowledge discovery
Ubiquitous knowledge discovery
Business process lines and decision tables driving flexibility by selection
SC'12 Proceedings of the 11th international conference on Software Composition
Distributed Privacy-Preserving Decision Support System for Highly Imbalanced Clinical Data
ACM Transactions on Management Information Systems (TMIS)
Hi-index | 0.00 |
Classification based on decision trees is one of the important problems in data mining and has applications in many fields. In recent years, database systems have become highly distributed, and distributed system paradigms, such as federated and peer-to-peer databases, are being adopted. In this paper, we consider the problem of inducing decision trees in a large distributed network of genomic databases. Our work is motivated by the existence of distributed databases in healthcare and in bioinformatics, and by the emergence of systems which automatically analyze these databases, and by the expectancy that these databases will soon contain large amounts of highly dimensional genomic data. Current decision tree algorithms require high communication bandwidth when executed on such data, which are large-scale distributed systems. We present an algorithm that sharply reduces the communication overhead by sending just a fraction of the statistical data. A fraction which is nevertheless sufficient to derive the exact same decision tree learned by a sequential learner on all the data in the network. Extensive experiments using standard synthetic SNP data show that the algorithm utilizes the high dependency among attributes, typical to genomic data, to reduce communication overhead by up to 99 percent. Scalability tests show that the algorithm scales well with both the size of the data set, the dimensionality of the data, and the size of the distributed system.