C4.5: programs for machine learning
C4.5: programs for machine learning
Efficient sampling strategies for relational database operations
ICDT Selected papers of the 4th international conference on Database theory
The power of sampling in knowledge discovery
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An introduction to computational learning theory
An introduction to computational learning theory
Query size estimation by adaptive sampling
Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A decision-theoretic generalization of on-line learning and an application to boosting
Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Scalable Mining for Classification Rules in Relational Databases
IDEAS '98 Proceedings of the 1998 International Symposium on Database Engineering & Applications
On Issues of Instance Selection
Data Mining and Knowledge Discovery
Sequential Sampling Techniques for Algorithmic Learning Theory
ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Algorithmic Aspects of Boosting
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Mining complex models from arbitrarily large databases in constant time
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast discovery of unexpected patterns in data, relative to a Bayesian network
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Optimization-based feature selection with adaptive instance sampling
Computers and Operations Research
Sequential sampling techniques for algorithmic learning theory
Theoretical Computer Science - Algorithmic learning theory (ALT 2000)
Mining evolving data streams for frequent patterns
Pattern Recognition
Quality-Aware Sampling and Its Applications in Incremental Data Mining
IEEE Transactions on Knowledge and Data Engineering
Constructing ensembles of symbolic classifiers
International Journal of Hybrid Intelligent Systems - Hybrid Intelligent systems in Ensembles
Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Approximate mining of frequent patterns on streams
Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Making CN2-SD subgroup discovery algorithm scalable to large size data sets using instance selection
Expert Systems with Applications: An International Journal
Schema matching on streams with accuracy guarantees
Intelligent Data Analysis - Knowledge Discovery from Data Streams
Feature-preserved sampling over streaming data
ACM Transactions on Knowledge Discovery from Data (TKDD)
An improved Adaboost.R algorithm and its application in mining safety monitoring
IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
An efficient preprocessing stage for the relationship-based clustering framework
Intelligent Data Analysis
Smooth boosting using an information-based criterion
ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Parallel mining of maximal sequential patterns using multiple samples
The Journal of Supercomputing
Sampling ensembles for frequent patterns
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
ACM Transactions on Knowledge Discovery from Data (TKDD)
On instance selection in audio based emotion recognition
ANNPR'12 Proceedings of the 5th INNS IAPR TC 3 GIRPR conference on Artificial Neural Networks in Pattern Recognition
Toward the scalability of neural networks through feature selection
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling method that solves a general problem covering many actual problems arising in applications of discovery science. An algorithm following this method obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it iadaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than required in the worst case. We prove the correctness of our method and estimates its efficiency theoretically. For illustrating its usefulness, we consider one concrete task requiring sampling, provide an algorithm based on our method, and show its efficiency experimentally.