Composite subset measures

Authors:
Lei Chen;Raghu Ramakrishnan;Paul Barford;Bee-Chung Chen;Vinod Yegneswaran
Affiliations:
Computer Sciences Department, University of Wisconsin, Madison, WI;Computer Sciences Department, University of Wisconsin, Madison, WI and Yahoo! Research, Santa Clara, CA;Computer Sciences Department, University of Wisconsin, Madison, WI;Computer Sciences Department, University of Wisconsin, Madison, WI;Computer Sciences Department, University of Wisconsin, Madison, WI
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 21
Cited 2

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
An approximation algorithm for the generalized assignment problem

Mathematical Programming: Series A and B
Extending complex ad-hoc OLAP

Proceedings of the eighth international conference on Information and knowledge management
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Optimizing Queries with Aggregate Views

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Complex Aggregation at Multiple Granularities

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Index Selection for OLAP

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
The MD-join: An Operator for Complex OLAP

Proceedings of the 17th International Conference on Data Engineering
Materialized View Selection for Multidimensional Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Aggregate-Query Processing in Data Warehousing Environments

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Eager Aggregation and Lazy Aggregation

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Querying Multiple Features of Groups in Relational Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Evaluation of Ad Hoc OLAP: In-Place Computation

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Characteristics of internet background radiation

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Mass Spectrum Labeling: Theory and Practice

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Efficient computation of multiple group by queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Toward a Query Language for Network Attack Data

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Bellwether analysis: predicting global aggregates from local regions

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measures are numeric summaries of a collection of data records produced by applying aggregation functions. Summarizing a collection of subsets of a large dataset, by computing a measure for each subset in the (typically, user-specified) collection is a fundamental problem. The multidimensional data model, which treats records as points in a space defined by dimension attributes, offers a natural space of data subsets to be considered as summarization candidates, and traditional SQL and OLAP constructs, such as GROUP BY and CUBE, allow us to compute measures for subsets drawn from this space. However, GROUP BY only allows us to summarize a limited collection of subsets, and CUBE summarizes all subsets in this space. Further, they restrict the measure used to summarize a data subset to be a one-step aggregation, using functions such as SUM, of field-values in the data records.In this paper, we introduce composite subset measures, computed by aggregating not only data records but also the measures of other related subsets. We allow summarization of naturally related regions in the multidimensional space, offering more flexibility than either GROUP BY or CUBE in the choice of what data subsets to summarize. Thus, our framework allows more meaningful summaries to be computed for a targeted collection of data subsets.We propose an algebra called AW-RA and an equivalent pictorial language called aggregation workflows. Aggregation workflows allow for intuitive expression of composite measure queries, and the underlying algebra is designed to facilitate efficient multiscan execution. We describe an evaluation framework based on multiple passes of sorting and scanning over the original dataset. In each pass, several measures are evaluated simultaneously, and dependencies between these measures and containment relationships between the underlying subsets of data are orchestrated to reduce the memory footprint of the computation. We present a performance evaluation that demonstrates the benefits of our approach.