Mining complex models from arbitrarily large databases in constant time

Authors:
Geoff Hulten;Pedro Domingos
Affiliations:
University of Washington, Seattle, WA;University of Washington, Seattle, WA
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 12
Cited 18

Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
PALO: a probabilistic hill-climbing algorithm

Artificial Intelligence
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
KDD-Cup 2000 organizers' report: peeling the onion

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Data Mining and Knowledge Discovery
Incremental Maximization of Non-Instance-Averaging Utility Functions with Applications to Knowledge Discovery Problems

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Learning bayesian network structure from massive datasets: the «sparse candidate« algorithm

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
On the sample complexity of learning Bayesian networks

UAI'96 Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence

Discovering decision rules from numerical data streams

Proceedings of the 2004 ACM symposium on Applied computing
Tractable learning of large Bayes net structures from sparse data

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Fast discovery of unexpected patterns in data, relative to a Bayesian network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Learning the structure of Markov logic networks

ICML '05 Proceedings of the 22nd international conference on Machine learning
Markov logic networks

Machine Learning
Graphical models of residue coupling in protein families

Proceedings of the 5th international workshop on Bioinformatics
Bayes net graphs to understand co-authorship networks?

Proceedings of the 3rd international workshop on Link discovery
Sequential update of ADtrees

ICML '06 Proceedings of the 23rd international conference on Machine learning
Info-fuzzy algorithms for mining dynamic data streams

Applied Soft Computing
Mining Arbitrarily Large Datasets Using Heuristic k-Nearest Neighbour Search

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Indexing density models for incremental learning and anytime classification on data streams

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
CBDT: A Concept Based Approach to Data Stream Mining

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Harnessing the strengths of anytime algorithms for constant data streams

Data Mining and Knowledge Discovery
Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts

Artificial Intelligence
Streaming data reduction using low-memory factored representations

Information Sciences: an International Journal
Voting massive collections of bayesian network classifiers for data streams

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
A few useful things to know about machine learning

Communications of the ACM
Monte Carlo MCMC: efficient inference by approximate sampling

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this paper we propose a scaling-up method that is applicable to essentially any induction algorithm based on discrete search. The result of applying the method to an algorithm is that its running time becomes independent of the size of the database, while the decisions made are essentially identical to those that would be made given infinite data. The method works within pre-specified memory limits and, as long as the data is iid, only requires accessing it sequentially. It gives anytime results, and can be used to produce batch, stream, time-changing and active-learning versions of an algorithm. We apply the method to learning Bayesian networks, developing an algorithm that is faster than previous ones by orders of magnitude, while achieving essentially the same predictive performance. We observe these gains on a series of large databases "generated from benchmark networks, on the KDD Cup 2000 e-commerce data, and on a Web log containing 100 million requests.