Machine Learning
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Computational Statistics & Data Analysis - Nonlinear methods and data mining
Machine Learning
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
International Journal of Hybrid Intelligent Systems
A regression framework for learning ranking functions using relative relevance judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Trada: tree based ranking function adaptation
Proceedings of the 17th ACM conference on Information and knowledge management
A fast decision tree learning algorithm
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Prediction of favourite photos using social, visual, and textual signals
Proceedings of the international conference on Multimedia
Parallel boosted regression trees for web search ranking
Proceedings of the 20th international conference on World wide web
Democrats, republicans and starbucks afficionados: user classification in twitter
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Behavior-driven clustering of queries into topics
Proceedings of the 20th ACM international conference on Information and knowledge management
A large-scale sentiment analysis for Yahoo! answers
Proceedings of the fifth ACM international conference on Web search and data mining
Malware characteristics and threats on the internet ecosystem
Journal of Systems and Software
Delta-SimRank computing on MapReduce
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Training efficient tree-based models for document ranking
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Contextual rule-based feature engineering for author-paper identification
Proceedings of the 2013 KDD Cup 2013 Workshop
Proceedings of the 7th ACM international conference on Web search and data mining
Feature engineering for semantic place prediction
Pervasive and Mobile Computing
Hi-index | 0.00 |
Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.