Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

Authors:
Bee-Chung Chen;Raghu Ramakrishnan;Jude W. Shavlik;Pradeep Tamma
Affiliations:
Yahoo! Research, Santa Clara, CA;Yahoo! Research, Santa Clara, CA;University of Wisconsin—Madison, WI;Microsoft, Redmond, WA
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2009

Citing 45
Cited 1

Parallel distributed processing: explorations in the microstructure of cognition, vol. 2: psychological and biological models

Parallel distributed processing: explorations in the microstructure of cognition, vol. 2: psychological and biological models
Inferring decision trees using the minimum description length principle

Information and Computation
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
The nature of statistical learning theory

The nature of statistical learning theory
Solving the multiple instance problem with axis-parallel rectangles

Artificial Intelligence
On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems

Theoretical Computer Science
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Efficient computation of Iceberg cubes with complex measures

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Iceberg-cube computation with PC clusters

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning

Machine Learning
Database Management Systems

Database Management Systems
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Learning cost-sensitive active classifiers

Artificial Intelligence
Complex Aggregation at Multiple Granularities

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
The MD-join: An Operator for Complex OLAP

Proceedings of the 17th International Conference on Data Engineering
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Learning Probabilistic Relational Models

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
An introduction to variable and feature selection

The Journal of Machine Learning Research
Aggregation-based feature invention and relational concept classes

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
MM-Cubing: Computing Iceberg Cubes by Factorizing the Lattice Space

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Decision trees with minimal costs

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Test-Cost Sensitive Naive Bayes Classification

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Active Feature-Value Acquisition for Classifier Induction

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Cost-Constrained Data Acquisition for Intelligent Data Preparation

IEEE Transactions on Knowledge and Data Engineering
Supervised versus multiple instance learning: an empirical comparison

ICML '05 Proceedings of the 22nd international conference on Machine learning
An Expected Utility Approach to Active Feature-Value Acquisition

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Feature value acquisition in testing: a sequential batch test algorithm

ICML '06 Proceedings of the 23rd international conference on Machine learning
Discovering significant rules

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Composite subset measures

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Bellwether analysis: predicting global aggregates from local regions

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Neural Networks: A Comprehensive Foundation (3rd Edition)

Neural Networks: A Comprehensive Foundation (3rd Edition)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Exploratory mining in cube space

Data Mining and Knowledge Discovery
Partial example acquisition in cost-sensitive learning

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Star-cubing: computing iceberg cubes by top-down and bottom-up integration

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
High-dimensional OLAP: a minimal cubing approach

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cube-space data mining

Cube-space data mining
Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm

Journal of Artificial Intelligence Research
Budgeted learning of nailve-bayes classifiers

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Learning and classifying under hard budgets

ECML'05 Proceedings of the 16th European conference on Machine Learning

Reducing the size of databases for multirelational classification: a subgraph-based approach

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

How to mine massive datasets is a challenging problem with great potential value. Motivated by this challenge, much effort has concentrated on developing scalable versions of machine learning algorithms. However, the cost of mining large datasets is not just computational; preparing the datasets into the “right form” so that learning algorithms can be applied is usually costly, due to the human labor that is typically required and a large number of choices in data preparation, which include selecting different subsets of data and aggregating data at different granularities. We make the key observation that, for a number of practically motivated problems, these choices can be defined using database queries and analyzed in an automatic and systematic manner. Specifically, we propose a new class of data-mining problem, called bellwether analysis, in which the goal is to find a few query-defined predictors (e.g., first week sales of Peoria, IL of an item) that can be used to accurately predict the result of a target query (e.g., first year worldwide sales of the item) from a large number of queries that define candidate predictors. To make a prediction for a new item, the data needed to generate such predictors has to be collected (e.g., selling the new item in Peoria, IL for a week and collecting the sales data). A useful predictor is one that has high prediction accuracy and a low data-collection cost. We call such a cost-effective predictor a bellwether. This article introduces bellwether analysis, which integrates database query processing and predictive modeling into a single framework, and provides scalable algorithms for large datasets that cannot fit in main memory. Through a series of extensive experiments, we show that bellwethers do exist in real-world databases, and that our computation techniques achieve good efficiency on large datasets.