Querying and Clustering Very Large Data Sets Using Dynamic Bucketing Approach

Authors:
Lixin Fu
Affiliations:
-
Venue:
WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Year:
2002

Citing 12
Cited 0

The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations

Communications of the ACM
Range queries in OLAP data cubes

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Partial-sum queries in OLAP data cubes using covering codes

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Computing the median with uncertainty

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Using wavelet decomposition to support progressive and approximate range-sum queries over data cubes

Proceedings of the ninth international conference on Information and knowledge management
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Novel Algorithms for Computing Medians and Other Quantiles of Disk-Resident Data

IDEAS '01 Proceedings of the International Database Engineering & Applications Symposium
Selection Algorithms for Parallel Disk Systems

HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Time bounds for selection

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we give a new efficient approach called dynamic bucketing to query and cluster very large data sets, a topic of great theoretical and practical significance. We partition data into equal-width buckets and further partition dense buckets into sub-buckets as needed by efficiently reclaiming and allocating memory space. The bucketing process dynamically adapts to the input order and distribution of input data sets. We propose a new data structure called the structure trees for storing aggregation information such as histograms of the buckets and sub-buckets. Since we only store the count values, the data structure is highly aggregated, concise, and very suitable for data sets with a large number of records. We also provide new query evaluation and data clustering algorithms based on the structure trees. Our simulation results show that our approach is superior to the current leading approaches in terms of the accuracy and performance.