A formal framework for database sampling

  • Authors:
  • Jesús Bisbal;Jane Grimson;David Bell

  • Affiliations:
  • Department of Technology, Universitat Pompeu Fabra, Passeig de Circumval-lacio 8, 08003 Barcelona, Spain;Department of Computer Science, Trinity College Dublin, Dublin, Ireland;School of Computer Science, Queen's University, Belfast, United Kingdom

  • Venue:
  • Information and Software Technology
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Database sampling is commonly used in applications like data mining and approximate query evaluation in order to achieve a compromise between the accuracy of the results and the computational cost of the process. The authors have recently proposed the use of database sampling in the context of populating a prototype database, that is, a database used to support the development of data-intensive applications. Existing methods for constructing prototype databases commonly populate the resulting database with synthetic data values. A more realistic approach is to sample a database so that the resulting sample satisfies a predefined set of integrity constraints. The resulting database, with domain-relevant data values and semantics, is expected to better support the software development process. This paper presents a formal study of database sampling. A Denotational Semantics description of database sampling is first discussed. Then the paper characterises the types of integrity constraints that must be considered during sampling. Lastly, the sampling strategy presented here is applied to improve the data quality of a (legacy) database. In this context, database sampling is used to incrementally identify the set of tuples which are the cause of inconsistencies in the database, and therefore should be the ones to be addressed by the data cleaning process.