A Survey of Methods for Scaling Up Inductive Algorithms

Authors:
Foster Provost;Venkateswarlu Kolluri
Affiliations:
Bell Atlantic Science and Technology, 500 Westchester Avenue, White Plains, New York 10604. provost@acm.org;Department of Information Science, University of Pittsburgh, Pittsburgh, PA 15260, and Lycos, Inc., 5001 Centre Avenue, Pittsburgh, PA 15213. venkat@sis.pitt.edu
Venue:
Data Mining and Knowledge Discovery
Year:
1999

Citing 67
Cited 75

A theory of the learnable

Communications of the ACM
Simplifying decision trees

International Journal of Man-Machine Studies - Special Issue: Knowledge Acquisition for Knowledge-based Systems. Part 5
Quantifying inductive bias: AI learning algorithms and Valiant's learning framework

Artificial Intelligence
Parallel depth first search. Part I. implementation

International Journal of Parallel Programming
Parallel depth first search. Part II. analysis

International Journal of Parallel Programming
Boolean Feature Discovery in Empirical Learning

Machine Learning
Maximizing the predictive value of production rules

Artificial Intelligence
An ounce of knowledge is worth a ton of data: quantitative studies of the trade-off between expertise and data based on statistically well-founded empirical induction

Proceedings of the sixth international workshop on Machine learning
Incremental batch learning

Proceedings of the sixth international workshop on Machine learning
Symbolic and Neural Learning Algorithms: An Experimental Comparison

Machine Learning
ARIEL: a massively parallel symbolic learning assistant for protein structure and function

Artificial intelligence at MIT expanding frontiers
C4.5: programs for machine learning

C4.5: programs for machine learning
Efficient noise-tolerant learning from statistical queries

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
Policies for the selection of bias in inductive machine learning

Policies for the selection of bias in inductive machine learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Estimating attributes: analysis and extensions of RELIEF

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
Massively parallel matching of knowledge structures

Massively parallel artificial intelligence
A storage system for scalable knowledge representation

CIKM '94 Proceedings of the third international conference on Information and knowledge management
Learning decision lists using homogeneous rules

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
An Experimental Comparison of the Nearest-Neighbor and Nearest-Hyperrectangle Algorithms

Machine Learning
Digital libraries

Communications of the ACM
Evaluation and Selection of Biases in Machine Learning

Machine Learning - Special issue on bias evaluation and selection
Inductive Policy: The Pragmatics of Bias Selection

Machine Learning - Special issue on bias evaluation and selection
Parka: A system for massively parallel knowledge representation

Parka: A system for massively parallel knowledge representation
Scaling up inductive learning with massive parallelism

Machine Learning
Mining quantitative association rules in large relational tables

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Wrappers for performance enhancement and oblivious decision graphs

Wrappers for performance enhancement and oblivious decision graphs
Error reduction through learning multiple descriptions

Machine Learning
On the Accuracy of Meta-learning for Scalable Data Mining

Journal of Intelligent Information Systems
From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Data surveyor: the nuggets in parallel

Advances in knowledge discovery and data mining
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Integrating association rule mining with relational database systems: alternatives and implications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Exploiting parallelism in a structural scientific discovery system to improve scalability

Journal of the American Society for Information Science - Special topic issue: youth issues in information science
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Multiple Comparisons in Induction Algorithms

Machine Learning
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Machine Learning
Mining Very Large Databases with Parallel Processing

Mining Very Large Databases with Parallel Processing
Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF

Applied Intelligence
Editorial

Data Mining and Knowledge Discovery
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Scaling Up Inductive Logic Programming by Learning from Interpretations

Data Mining and Knowledge Discovery
An Information Theoretic Approach to Rule Induction from Databases

IEEE Transactions on Knowledge and Data Engineering
Data Mining: An Overview from a Database Perspective

IEEE Transactions on Knowledge and Data Engineering
Incremental Induction of Decision Trees

Machine Learning
Induction of Decision Trees

Machine Learning
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
The Effects of Training Set Size on Decision Tree Complexity

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Knowledge Acquisition form Examples Vis Multiple Models

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Multi-layer Incremental Induction

PRICAI '98 Proceedings of the 5th Pacific Rim International Conference on Artificial Intelligence: Topics in Artificial Intelligence
Induction of One-Level Decision Trees

ML '92 Proceedings of the Ninth International Workshop on Machine Learning
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Parallel Classification for Data Mining on Shared-Memory Multiprocessors

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Extracting comprehensible models from trained neural networks

Extracting comprehensible models from trained neural networks
Free parallel data mining

Free parallel data mining
Scalable data mining for rules

Scalable data mining for rules
OPUS: an efficient admissible algorithm for unordered search

Journal of Artificial Intelligence Research
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research
Integrative Windowing

Journal of Artificial Intelligence Research
Knowledge representation in the large

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Generating C4.5 production rules in parallel

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Scaling up: distributed machine learning with cooperation

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
High performance data mining (tutorial PM-3)

Tutorial notes of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting collective probabilistic forecasts from web games

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Understanding the Crucial Role of AttributeInteraction in Data Mining

Artificial Intelligence Review
Density-Based Multiscale Data Condensation

IEEE Transactions on Pattern Analysis and Machine Intelligence
An integrated approach for scaling up classification and prediction algorithms for data mining

SAICSIT '02 Proceedings of the 2002 annual research conference of the South African institute of computer scientists and information technologists on Enablement through technology
On Issues of Instance Selection

Data Mining and Knowledge Discovery
Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

Data Mining and Knowledge Discovery
Pasting Small Votes for Classification in Large Databases and On-Line

Machine Learning
Synthesizing High-Frequency Rules from Different Data Sources

IEEE Transactions on Knowledge and Data Engineering
Efficiently Determining the Starting Sample Size for Progressive Sampling

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Iteratively Selecting Feature Subsets for Mining from High-Dimensional Databases

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Discovering Knowledge from Meteorological Databases: A Meteorological Aviation Forecast Study

DaWaK '01 Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery
Shared Memory Parallelization of Decision Tree Construction Using a General Data Mining Middleware

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Efficient Data Mining by Active Learning

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
S3Bagging: Fast Classifier Induction Method with Subsampling and Bagging

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Parallel and Distributed Data Mining: An Introduction

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Implementation and performance evaluation of dynamic scheduling for parallel decision tree generation

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Data mining tasks and methods: scalability

Handbook of data mining and knowledge discovery
Data mining tasks and methods: parallel methods for scaling data mining algorithms to large data sets

Handbook of data mining and knowledge discovery
Machine learning

Handbook of data mining and knowledge discovery
Tree Induction for Probability-Based Ranking

Machine Learning
Tree induction vs. logistic regression: a learning-curve analysis

The Journal of Machine Learning Research
Prototype-based mining of numeric data streams

Proceedings of the 2003 ACM symposium on Applied computing
PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering decision rules from numerical data streams

Proceedings of the 2004 ACM symposium on Applied computing
Learning Ensembles from Bites: A Scalable and Accurate Approach

The Journal of Machine Learning Research
Lessons and Challenges from Mining Retail E-Commerce Data

Machine Learning
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets

IEEE Transactions on Knowledge and Data Engineering
Toward Intelligent Assistance for a Data Mining Process: An Ontology-Based Approach for Cost-Sensitive Classification

IEEE Transactions on Knowledge and Data Engineering
A Services Oriented Framework for Next Generation Data Analysis Centers

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Design of a next generation sampling service for large scale data analysis applications

Proceedings of the 19th annual international conference on Supercomputing
Enhancing Density-Based Data Reduction Using Entropy

Neural Computation
A scalable decision tree system and its application in pattern recognition and intrusion detection

Decision Support Systems
Maxdiff kd-trees for data condensation

Pattern Recognition Letters
Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Data Mining and Knowledge Discovery
Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis

ACM Transactions on Mathematical Software (TOMS)
Optimization-based feature selection with adaptive instance sampling

Computers and Operations Research
A new imputation method for small software project data sets

Journal of Systems and Software
Input data for decision trees

Expert Systems with Applications: An International Journal
Genetic algorithm-based feature set partitioning for classification problems

Pattern Recognition
Genetic algorithm-based feature set partitioning for classification problems

Pattern Recognition
Parallel learning using decision trees: a novel approach

AMCOS'05 Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science
Making CN2-SD subgroup discovery algorithm scalable to large size data sets using instance selection

Expert Systems with Applications: An International Journal
DataJewel: Integrating Visualization with Temporal Data Mining

Visual Data Mining
A Feature Selection Algorithm Based on Discernibility Matrix

Computational Intelligence and Security
Learning Classifiers from Large Databases Using Statistical Queries

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Pruning an ensemble of classifiers via reinforcement learning

Neurocomputing
HTILDE: scaling up relational decision trees for very large databases

Proceedings of the 2009 ACM symposium on Applied Computing
A divide-and-conquer recursive approach for scaling up instance selection algorithms

Data Mining and Knowledge Discovery
A hybrid approach to design efficient learning classifiers

Computers & Mathematics with Applications
A fast decision tree learning algorithm

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Stochastic gradient boosted distributed decision trees

Proceedings of the 18th ACM conference on Information and knowledge management
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
A scalable decision tree system and its application in pattern recognition and intrusion detection

Decision Support Systems
Ensemble-based classifiers

Artificial Intelligence Review
Why fuzzy decision trees are good rankers

IEEE Transactions on Fuzzy Systems
Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

Artificial Intelligence
Database implementation of a model-free classifier

ADBIS'07 Proceedings of the 11th East European conference on Advances in databases and information systems
Association rule mining: models and algorithms

Association rule mining: models and algorithms
CAMEO: continuous analytics for massively multiplayer online games on cloud resources

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Genetics-based machine learning for rule induction: state of the art, taxonomy, and comparative study

IEEE Transactions on Evolutionary Computation
CAMEO: enabling social networks for massively multiplayer online games through continuous analytics and cloud computing

Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
Scaling up feature selection by means of democratization

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part II
Local graph sparsification for scalable clustering

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A comparative analysis of methods for probability estimation tree

WSEAS Transactions on Computers
Scalability analysis of ANN training algorithms with feature selection

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Scalable inductive learning on partitioned data

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Editorial: Large scale instance selection by means of federal instance selection

Data & Knowledge Engineering
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Texture based decision tree classification for Arecanut

Proceedings of the CUBE International Information Technology Conference
A scalable approach to simultaneous evolutionary instance and feature selection

Information Sciences: an International Journal
Toward the scalability of neural networks through feature selection

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the defining challenges for the KDD researchcommunity is to enable inductive learning algorithms to mine verylarge databases. This paper summarizes, categorizes, and comparesexisting work on scaling up inductive algorithms. We concentrate onalgorithms that build decision trees and rule sets, in order toprovide focus and specific details; the issues and techniquesgeneralize to other types of data mining. We begin with a discussionof important issues related to scaling up. We highlight similaritiesamong scaling techniques by categorizing them into three mainapproaches. For each approach, we then describe, compare, andcontrast the different constituent techniques, drawing on specificexamples from published papers. Finally, we use the precedinganalysis to suggest how to proceed when dealing with a largeproblem, and where to focus future research.