A formal framework for database sampling

Authors:
Jesús Bisbal;Jane Grimson;David Bell
Affiliations:
Department of Technology, Universitat Pompeu Fabra, Passeig de Circumval-lacio 8, 08003 Barcelona, Spain;Department of Computer Science, Trinity College Dublin, Dublin, Ireland;School of Computer Science, Queen's University, Belfast, United Kingdom
Venue:
Information and Software Technology
Year:
2005

Citing 23
Cited 2

Introduction to combinators and &lgr;-calculus

Introduction to combinators and &lgr;-calculus
A practical introduction to denotational semantics

A practical introduction to denotational semantics
The CHRIS consultant: a tool for database design and rapid prototyping

Information Systems
Fundamentals of software engineering

Fundamentals of software engineering
A denotational semantics for the Starburst production rule language

ACM SIGMOD Record
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Quickly generating billion-record synthetic databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Modelling test data for performance evaluation of large parallel database machines

Distributed and Parallel Databases
Integrity constraints: semantics and applications

Logics for databases and information systems
An introduction to database systems (7th ed.)

An introduction to database systems (7th ed.)
Small Armstrong relations for database design

PODS '85 Proceedings of the fourth ACM SIGACT-SIGMOD symposium on Principles of database systems
Denotational Semantics: The Scott-Strachey Approach to Programming Language Theory

Denotational Semantics: The Scott-Strachey Approach to Programming Language Theory
Elements of the Theory of Computation

Elements of the Theory of Computation
Information Systems Development: Methodologies, Techniques, and Tools

Information Systems Development: Methodologies, Techniques, and Tools
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Generating consistent test data: restricting the search space by a generator formula

The VLDB Journal — The International Journal on Very Large Data Bases
Legacy Information Systems: Issues and Directions

IEEE Software
A Framework for Analysis of Data Quality Research

IEEE Transactions on Knowledge and Data Engineering
Enhancing the Quality of Conceptual Database Specifications through Validation

ER '93 Proceedings of the 12th International Conference on the Entity-Relationship Approach: Entity-Relationship Approach
Building Consistent Sample Databases to Support Information System Evolution and Migration

DEXA '98 Proceedings of the 9th International Conference on Database and Expert Systems Applications
Consistent database sampling as a database prototyping approach

Journal of Software Maintenance: Research and Practice

Empirical evidence for the usefulness of Armstrong relations in the acquisition of meaningful functional dependencies

Information Systems
Towards realistic sampling: generating dependencies in a relational database

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Database sampling is commonly used in applications like data mining and approximate query evaluation in order to achieve a compromise between the accuracy of the results and the computational cost of the process. The authors have recently proposed the use of database sampling in the context of populating a prototype database, that is, a database used to support the development of data-intensive applications. Existing methods for constructing prototype databases commonly populate the resulting database with synthetic data values. A more realistic approach is to sample a database so that the resulting sample satisfies a predefined set of integrity constraints. The resulting database, with domain-relevant data values and semantics, is expected to better support the software development process. This paper presents a formal study of database sampling. A Denotational Semantics description of database sampling is first discussed. Then the paper characterises the types of integrity constraints that must be considered during sampling. Lastly, the sampling strategy presented here is applied to improve the data quality of a (legacy) database. In this context, database sampling is used to incrementally identify the set of tuples which are the cause of inconsistencies in the database, and therefore should be the ones to be addressed by the data cleaning process.