Bellwether analysis: predicting global aggregates from local regions

Authors:
Bee-Chung Chen;Raghu Ramakrishnan;Jude W. Shavlik;Pradeep Tamma
Affiliations:
University of Wisconsin, Madison;University of Wisconsin, Madison and Yahoo! Research, Santa Clara, CA;University of Wisconsin, Madison;University of Wisconsin, Madison
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 15
Cited 3

Inferring decision trees using the minimum description length principle

Information and Computation
C4.5: programs for machine learning

C4.5: programs for machine learning
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient computation of Iceberg cubes with complex measures

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning

Machine Learning
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Learning Probabilistic Relational Models

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Aggregation-based feature invention and relational concept classes

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
MM-Cubing: Computing Iceberg Cubes by Factorizing the Lattice Space

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Prediction cubes

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Supervised versus multiple instance learning: an empirical comparison

ICML '05 Proceedings of the 22nd international conference on Machine learning
Composite subset measures

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Multi-dimensional regression analysis of time-series data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Towards keyword-driven analytical processing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

ACM Transactions on Knowledge Discovery from Data (TKDD)
Adversarial-knowledge dimensions in data privacy

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.01

Visualization

Abstract

Massive datasets are becoming commonplace in a wide range of domains, and mining them is recognized as a challenging problem with great potential value. Motivated by this challenge, much effort has been concentrated on developing scalable versions of machine learning algorithms. An often overlooked issue is that large datasets are rarely labeled with the outputs that we wish to learn to predict, due to the human labor required. We make the key observation that analysts can often use queries to define labels for cases, which leads to the problem of learning to predict such query-produced labels. Of course, if a dataset is available in its entirety, we can simply run the query again to compute labels. The interesting scenarios are those where, after the predictive model is trained, new data is gathered at significant incremental cost and, perhaps, over time. The challenge is to accurately predict the query-labels for the projected completion of new datasets, based only on certain cost-effective subsets, which we call bellwethers.