Parallel distributed processing: explorations in the microstructure of cognition, vol. 2: psychological and biological models
Inferring decision trees using the minimum description length principle
Information and Computation
C4.5: programs for machine learning
C4.5: programs for machine learning
Improving Generalization with Active Learning
Machine Learning - Special issue on structured connectionist systems
The nature of statistical learning theory
The nature of statistical learning theory
Solving the multiple instance problem with axis-parallel rectangles
Artificial Intelligence
On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems
Theoretical Computer Science
Bottom-up computation of sparse and Iceberg CUBE
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Efficient computation of Iceberg cubes with complex measures
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Iceberg-cube computation with PC clusters
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning
Database Management Systems
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Data Mining and Knowledge Discovery
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Learning cost-sensitive active classifiers
Artificial Intelligence
Complex Aggregation at Multiple Granularities
EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
The MD-join: An Operator for Complex OLAP
Proceedings of the 17th International Conference on Data Engineering
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
On the Computation of Multidimensional Aggregates
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Learning Probabilistic Relational Models
IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
An introduction to variable and feature selection
The Journal of Machine Learning Research
Aggregation-based feature invention and relational concept classes
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
MM-Cubing: Computing Iceberg Cubes by Factorizing the Lattice Space
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Decision trees with minimal costs
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Test-Cost Sensitive Naive Bayes Classification
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Active Feature-Value Acquisition for Classifier Induction
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Cost-Constrained Data Acquisition for Intelligent Data Preparation
IEEE Transactions on Knowledge and Data Engineering
Supervised versus multiple instance learning: an empirical comparison
ICML '05 Proceedings of the 22nd international conference on Machine learning
An Expected Utility Approach to Active Feature-Value Acquisition
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Feature value acquisition in testing: a sequential batch test algorithm
ICML '06 Proceedings of the 23rd international conference on Machine learning
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Bellwether analysis: predicting global aggregates from local regions
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Neural Networks: A Comprehensive Foundation (3rd Edition)
Neural Networks: A Comprehensive Foundation (3rd Edition)
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Exploratory mining in cube space
Data Mining and Knowledge Discovery
Partial example acquisition in cost-sensitive learning
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Star-cubing: computing iceberg cubes by top-down and bottom-up integration
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
High-dimensional OLAP: a minimal cubing approach
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cube-space data mining
Journal of Artificial Intelligence Research
Budgeted learning of nailve-bayes classifiers
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Learning and classifying under hard budgets
ECML'05 Proceedings of the 16th European conference on Machine Learning
Reducing the size of databases for multirelational classification: a subgraph-based approach
Journal of Intelligent Information Systems
Hi-index | 0.00 |
How to mine massive datasets is a challenging problem with great potential value. Motivated by this challenge, much effort has concentrated on developing scalable versions of machine learning algorithms. However, the cost of mining large datasets is not just computational; preparing the datasets into the “right form” so that learning algorithms can be applied is usually costly, due to the human labor that is typically required and a large number of choices in data preparation, which include selecting different subsets of data and aggregating data at different granularities. We make the key observation that, for a number of practically motivated problems, these choices can be defined using database queries and analyzed in an automatic and systematic manner. Specifically, we propose a new class of data-mining problem, called bellwether analysis, in which the goal is to find a few query-defined predictors (e.g., first week sales of Peoria, IL of an item) that can be used to accurately predict the result of a target query (e.g., first year worldwide sales of the item) from a large number of queries that define candidate predictors. To make a prediction for a new item, the data needed to generate such predictors has to be collected (e.g., selling the new item in Peoria, IL for a week and collecting the sales data). A useful predictor is one that has high prediction accuracy and a low data-collection cost. We call such a cost-effective predictor a bellwether. This article introduces bellwether analysis, which integrates database query processing and predictive modeling into a single framework, and provides scalable algorithms for large datasets that cannot fit in main memory. Through a series of extensive experiments, we show that bellwethers do exist in real-world databases, and that our computation techniques achieve good efficiency on large datasets.