Stochastic gradient boosted distributed decision trees

Authors:
Jerry Ye;Jyh-Herng Chow;Jiang Chen;Zhaohui Zheng
Affiliations:
Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 11
Cited 11

Random Forests

Machine Learning
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Stochastic gradient boosting

Computational Statistics & Data Analysis - Nonlinear methods and data mining
Induction of Decision Trees

Machine Learning
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

International Journal of Hybrid Intelligent Systems
A regression framework for learning ranking functions using relative relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Trada: tree based ranking function adaptation

Proceedings of the 17th ACM conference on Information and knowledge management
A fast decision tree learning algorithm

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1

Prediction of favourite photos using social, visual, and textual signals

Proceedings of the international conference on Multimedia
Parallel boosted regression trees for web search ranking

Proceedings of the 20th international conference on World wide web
Democrats, republicans and starbucks afficionados: user classification in twitter

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Behavior-driven clustering of queries into topics

Proceedings of the 20th ACM international conference on Information and knowledge management
A large-scale sentiment analysis for Yahoo! answers

Proceedings of the fifth ACM international conference on Web search and data mining
Malware characteristics and threats on the internet ecosystem

Journal of Systems and Software
Delta-SimRank computing on MapReduce

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Training efficient tree-based models for document ranking

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Contextual rule-based feature engineering for author-paper identification

Proceedings of the 2013 KDD Cup 2013 Workshop
Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising

Proceedings of the 7th ACM international conference on Web search and data mining
Feature engineering for semantic place prediction

Pervasive and Mobile Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stochastic Gradient Boosted Decision Trees (GBDT) is one of the most widely used learning algorithms in machine learning today. It is adaptable, easy to interpret, and produces highly accurate models. However, most implementations today are computationally expensive and require all training data to be in main memory. As training data becomes ever larger, there is motivation for us to parallelize the GBDT algorithm. Parallelizing decision tree training is intuitive and various approaches have been explored in existing literature. Stochastic boosting on the other hand is inherently a sequential process and have not been applied to distributed decision trees. In this work, we present two different distributed methods that generates exact stochastic GBDT models, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.