Sampling from databases using B$^+$-Trees

  • Authors:
  • Dimuthu Makawita;Kian-Lee Tan;Huan Liu

  • Affiliations:
  • School of ICT, NgeeAnn Polytechnic, 535 Clementi Road, Singapore 599489. Tel.: +65 460 6897/ E-mail: mdp@np.edu.sg;Department of Computer Science, National University of Singapore, 3 Science Drive 2, Singapore 117543. Tel.: +65 874 2862/ E-mail: tankl@comp.nus.edu.sg;Department of Computer Science & Engineering, Arizona State University, P.O. BOX 875406, Tempe, AZ 85287-5406, USA. Tel.: +1 480 727 7349/ E-mail: hliu@asu.edu

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sampling techniques are becoming increasingly important for verylarge databases. However, the problem of obtaining a random samplefrom index structures has not received much attention. In thispaper, we examine sampling techniques for B^+-tree. As the fanoutof each node varies, a random walk through the index structure doesnot produce a good representative sample of the data set. Wepropose a new technique, called B^+-Tree based Weighted RandomSampling (BTWRS), that alters the inclusion probabilities ofrecords accordingly to allow more records from leaves, along thepaths with higher fanouts, to be extracted. We extensivelyevaluated our method, and the results show that there is animprovement in BTWRS over the existing schemes in terms of thequality of the samples obtained and the efficiency of the samplingprocess. The proposed method can be readily adopted in existingcommercial systems.