Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Authors:
Carlos Domingo;Ricard Gavaldà;Osamu Watanabe
Affiliations:
Department of Mathematics and Computer Sciences, Tokyo Institute of Technology, Tokyo, Japan;Department of LSI, Universitat Politècnica de Catalunya, Barcelona, Spain;Department of Mathematics and Computer Sciences, Tokyo Institute of Technology, Tokyo, Japan
Venue:
Data Mining and Knowledge Discovery
Year:
2002

Citing 10
Cited 25

C4.5: programs for machine learning

C4.5: programs for machine learning
Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An introduction to computational learning theory

An introduction to computational learning theory
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Scalable Mining for Classification Rules in Relational Databases

IDEAS '98 Proceedings of the 1998 International Symposium on Database Engineering & Applications

On Issues of Instance Selection

Data Mining and Knowledge Discovery
Sequential Sampling Techniques for Algorithmic Learning Theory

ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Algorithmic Aspects of Boosting

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Mining complex models from arbitrarily large databases in constant time

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast discovery of unexpected patterns in data, relative to a Bayesian network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Optimization-based feature selection with adaptive instance sampling

Computers and Operations Research
Sequential sampling techniques for algorithmic learning theory

Theoretical Computer Science - Algorithmic learning theory (ALT 2000)
Mining evolving data streams for frequent patterns

Pattern Recognition
Quality-Aware Sampling and Its Applications in Incremental Data Mining

IEEE Transactions on Knowledge and Data Engineering
Constructing ensembles of symbolic classifiers

International Journal of Hybrid Intelligent Systems - Hybrid Intelligent systems in Ensembles
Statistical supports for mining sequential patterns and improving the incremental update process on data streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Approximate mining of frequent patterns on streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Making CN2-SD subgroup discovery algorithm scalable to large size data sets using instance selection

Expert Systems with Applications: An International Journal
Schema matching on streams with accuracy guarantees

Intelligent Data Analysis - Knowledge Discovery from Data Streams
Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes

Pattern Recognition Letters
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
An improved Adaboost.R algorithm and its application in mining safety monitoring

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
An efficient preprocessing stage for the relationship-based clustering framework

Intelligent Data Analysis
Smooth boosting using an information-based criterion

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Parallel mining of maximal sequential patterns using multiple samples

The Journal of Supercomputing
Sampling ensembles for frequent patterns

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
On instance selection in audio based emotion recognition

ANNPR'12 Proceedings of the 5th INNS IAPR TC 3 GIRPR conference on Artificial Neural Networks in Pattern Recognition
Toward the scalability of neural networks through feature selection

Expert Systems with Applications: An International Journal
2013 Special Issue: Methods for pattern selection, class-specific feature selection and classification for automated learning

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling method that solves a general problem covering many actual problems arising in applications of discovery science. An algorithm following this method obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it iadaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than required in the worst case. We prove the correctness of our method and estimates its efficiency theoretically. For illustrating its usefulness, we consider one concrete task requiring sampling, provide an algorithm based on our method, and show its efficiency experimentally.