The power of sampling in knowledge discovery

Authors:
Jyrki Kivinen;Heikki Mannila
Affiliations:
University of California, Santa Cruz;University of Helsinki
Venue:
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Year:
1994

Citing 11
Cited 43

A theory of the learnable

Communications of the ACM
Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
On approximate truth

COLT '89 Proceedings of the second annual workshop on Computational learning theory
Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
On estimating the size of projections

ICDT '90 Proceedings of the third international conference on database theory on Database theory
Semantic complexity of classes of relational queries and query independent data partitioning

PODS '91 Proceedings of the tenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Statistical estimators for aggregate relational algebra queries

ACM Transactions on Database Systems (TODS)
Asymptomatic conditional probabilities for first-order logic

STOC '92 Proceedings of the twenty-fourth annual ACM symposium on Theory of computing
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Horn clauses and database dependencies

Journal of the ACM (JACM)
Approximate Dependency Inference from Relations

ICDT '92 Proceedings of the 4th International Conference on Database Theory

Perspectives on database theory

ACM SIGACT News
An efficient and effective algorithm for density biased sampling

Proceedings of the eleventh international conference on Information and knowledge management
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

Data Mining and Knowledge Discovery
On Issues of Instance Selection

Data Mining and Knowledge Discovery
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Data Mining and Knowledge Discovery
Discovering interesting inclusion dependencies: application to logical database tuning

Information Systems - Databases: Creation, management and utilization
Sampling Strategies for Mining in Data-Scarce Domains

Computing in Science and Engineering
Data Mining-Guest Editors' Introduction: From Serendipity to Science

Computer
Efficiently Determining the Starting Sample Size for Progressive Sampling

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Sequential Sampling Algorithms: Unified Analysis and Lower Bounds

SAGA '01 Proceedings of the International Symposium on Stochastic Algorithms: Foundations and Applications
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
An Index for the Data Size to Extract Decomposable Structures in LAD

ISAAC '01 Proceedings of the 12th International Symposium on Algorithms and Computation
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

DS '99 Proceedings of the Second International Conference on Discovery Science
Consistent database sampling as a database prototyping approach

Journal of Software Maintenance: Research and Practice
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Progressive rademacher sampling

Eighteenth national conference on Artificial intelligence
A selective sampling approach to active feature selection

Artificial Intelligence
Elastic Translation Invariant Matching of Trajectories

Machine Learning
Association mining

ACM Computing Surveys (CSUR)
Indexed-based density biased sampling for clustering applications

Data & Knowledge Engineering
Optimization-based feature selection with adaptive instance sampling

Computers and Operations Research
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An approach to online optimization of heuristic coordination algorithms

Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems - Volume 2
Knowledge discovery query language (KDQL)

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
A divide-and-conquer recursive approach for scaling up instance selection algorithms

Data Mining and Knowledge Discovery
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Noise-tolerant windowing

IJCAI'97 Proceedings of the Fifteenth international joint conference on Artifical intelligence - Volume 2
Integrative Windowing

Journal of Artificial Intelligence Research
Ambiguity-directed sampling for qualitative analysis of sparse data from spatially-distributed physical systems

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Empirical evidence for the usefulness of Armstrong relations in the acquisition of meaningful functional dependencies

Information Systems
A formal framework for database sampling

Information and Software Technology
Focusing solutions for data mining: analytical studies and experimental results in real-world domains

Focusing solutions for data mining: analytical studies and experimental results in real-world domains
Frequent subgraph mining on a single large graph using sampling techniques

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
More efficient windowing

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
An efficient preprocessing stage for the relationship-based clustering framework

Intelligent Data Analysis
Discovering process models with genetic algorithms using sampling

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
A clustering-based data reduction for very large spatio-temporal datasets

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Distributed genetic process mining using sampling

PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
A new hybrid clustering method for reducing very large spatio-temporal dataset

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Multi-selection of instances: A straightforward way to improve evolutionary instance selection

Applied Soft Computing
Incremental linear model trees on massive datasets: keep it simple, keep it fast

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Dengue surveillance based on a computational model of spatio-temporal locality of Twitter

Proceedings of the 3rd International Web Science Conference
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of approximately verifying the truth of sentences of tuple relational calculus in a given relation M by considering only a random sample of M. We define two different measures for the error of a universal sentence in a relation. For a set of n universal sentences each with at most k universal quantifiers, we give upper and lower bounds for the sample sizes required for having a high probability that all the sentences with error at least &egr; can be detected as false by considering the sample. The sample sizes are O((log n)/&egr;) or O((|M|1–1/k)log n/&egr;), depending on the error measure used. We also consider universal-existential sentences.