Towards realistic sampling: generating dependencies in a relational database

Authors:
Teodora sandra buda;John Murphy;Morten Kristiansen
Affiliations:
University College Dublin;University College Dublin;IBM Software Group, Dublin, Ireland
Venue:
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Year:
2013

Citing 10
Cited 0

Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Building Consistent Sample Databases to Support Information System Evolution and Migration

DEXA '98 Proceedings of the 9th International Conference on Database and Expert Systems Applications
Consistent database sampling as a database prototyping approach

Journal of Software Maintenance: Research and Practice
Evaluation of Sampling for Data Mining of Association Rules

Evaluation of Sampling for Data Mining of Association Rules
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Analysis of sampling techniques for association rule mining

Proceedings of the 12th International Conference on Database Theory
A formal framework for database sampling

Information and Software Technology
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Managing large amounts of information is one of the most expensive, time-consuming and non-trivial activities and it usually requires expert knowledge. In a wide range of application areas, such as data mining, histogram construction, approximate query evaluation, and software validation, handling exponentially growing databases has become a difficult challenge, and a subset of the data is generally preferred. As a solution to the current challenges in managing large amounts of data, database sampling from the operational data available has proved to be a powerful technique. However, none of the existing sampling approaches consider the dependencies between the data in a relational database. In this paper, we propose a novel approach towards constructing a realistic testing environment, by analyzing the distribution of data in the original database along these dependencies before sampling, so that the sample database is representative to the original database.