Inferring decision trees using the minimum description length principle
Information and Computation
C4.5: programs for machine learning
C4.5: programs for machine learning
Implementing data cubes efficiently
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Bottom-up computation of sparse and Iceberg CUBE
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient computation of Iceberg cubes with complex measures
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Machine Learning
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Data Mining and Knowledge Discovery
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Learning Probabilistic Relational Models
IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Aggregation-based feature invention and relational concept classes
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
MM-Cubing: Computing Iceberg Cubes by Factorizing the Lattice Space
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Supervised versus multiple instance learning: an empirical comparison
ICML '05 Proceedings of the 22nd international conference on Machine learning
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Multi-dimensional regression analysis of time-series data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Towards keyword-driven analytical processing
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Bellwether analysis: Searching for cost-effective query-defined predictors in large databases
ACM Transactions on Knowledge Discovery from Data (TKDD)
Adversarial-knowledge dimensions in data privacy
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.01 |
Massive datasets are becoming commonplace in a wide range of domains, and mining them is recognized as a challenging problem with great potential value. Motivated by this challenge, much effort has been concentrated on developing scalable versions of machine learning algorithms. An often overlooked issue is that large datasets are rarely labeled with the outputs that we wish to learn to predict, due to the human labor required. We make the key observation that analysts can often use queries to define labels for cases, which leads to the problem of learning to predict such query-produced labels. Of course, if a dataset is available in its entirety, we can simply run the query again to compute labels. The interesting scenarios are those where, after the predictive model is trained, new data is gathered at significant incremental cost and, perhaps, over time. The challenge is to accurately predict the query-labels for the projected completion of new datasets, based only on certain cost-effective subsets, which we call bellwethers.