MetaCost: a general method for making classifiers cost-sensitive
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining high-speed data streams
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring lift quality in database marketing
ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Modern Information Retrieval
Machine Learning
Boosting Algorithms for Parallel and Distributed Learning
Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
Distributed learning with bagging-like performance
Pattern Recognition Letters
A Fully Distributed Framework for Cost-Sensitive Data Mining
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Incremental learning with partial instance memory
Artificial Intelligence
Learning Ensembles from Bites: A Scalable and Accurate Approach
The Journal of Machine Learning Research
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Systematic data selection to mine concept-drifting data streams
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On demand classification of data streams
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Active learning of label ranking functions
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Not So Naive Bayes: Aggregating One-Dependence Estimators
Machine Learning
Algorithms for discovering bucket orders from data
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Predict Salient Regions from Disjoint and Skewed Training Sets
ICTAI '06 Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence
A Comparison of Decision Tree Ensemble Creation Techniques
IEEE Transactions on Pattern Analysis and Machine Intelligence
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Statistical Comparisons of Classifiers over Multiple Data Sets
The Journal of Machine Learning Research
Using classifier ensembles to label spatially disjoint data
Information Fusion
Introduction to Information Retrieval
Introduction to Information Retrieval
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Journal of Artificial Intelligence Research
Ensembles of classifiers from spatially disjoint data
MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems
Ensemble of SVMs for incremental learning
MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems
Learning label preferences: ranking error versus position error
IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis
FCLib: a library for building data analysis and data discovery tools
IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis
Hi-index | 0.00 |
We describe an ensemble approach to learning salient regions from arbitrarily partitioned data. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data.