Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Building Consistent Sample Databases to Support Information System Evolution and Migration
DEXA '98 Proceedings of the 9th International Conference on Database and Expert Systems Applications
Consistent database sampling as a database prototyping approach
Journal of Software Maintenance: Research and Practice
Evaluation of Sampling for Data Mining of Association Rules
Evaluation of Sampling for Data Mining of Association Rules
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Analysis of sampling techniques for association rule mining
Proceedings of the 12th International Conference on Database Theory
A formal framework for database sampling
Information and Software Technology
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hi-index | 0.00 |
Managing large amounts of information is one of the most expensive, time-consuming and non-trivial activities and it usually requires expert knowledge. In a wide range of application areas, such as data mining, histogram construction, approximate query evaluation, and software validation, handling exponentially growing databases has become a difficult challenge, and a subset of the data is generally preferred. As a solution to the current challenges in managing large amounts of data, database sampling from the operational data available has proved to be a powerful technique. However, none of the existing sampling approaches consider the dependencies between the data in a relational database. In this paper, we propose a novel approach towards constructing a realistic testing environment, by analyzing the distribution of data in the original database along these dependencies before sampling, so that the sample database is representative to the original database.