The nature of statistical learning theory
The nature of statistical learning theory
Machine Learning
BOAT—optimistic decision tree construction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The distributed boosting algorithm
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
SLIQ: A Fast Scalable Classifier for Data Mining
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Parallel Classification for Data Mining on Shared-Memory Multiprocessors
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Efficient decision tree construction on streaming data
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Communication Efficient Construction of Decision Trees Over Heterogeneously Distributed Data
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
An empirical comparison of supervised learning algorithms
ICML '06 Proceedings of the 23rd international conference on Machine learning
International Journal of Hybrid Intelligent Systems
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An empirical evaluation of supervised learning in high dimensions
Proceedings of the 25th international conference on Machine learning
Predicting bounce rates in sponsored search advertisements
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining with Decision Trees: Theroy and Applications
Data Mining with Decision Trees: Theroy and Applications
Design patterns for efficient graph algorithms in MapReduce
Proceedings of the Eighth Workshop on Mining and Learning with Graphs
MapCG: writing parallel program portable between CPU and GPU
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Decomposing data mining by a process-oriented execution plan
AICI'10 Proceedings of the 2010 international conference on Artificial intelligence and computational intelligence: Part I
Parallel boosted regression trees for web search ranking
Proceedings of the 20th international conference on World wide web
Efficient processing of data warehousing queries in a split execution environment
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Rapid parallel genome indexing with MapReduce
Proceedings of the second international workshop on MapReduce and its applications
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Temporal multi-hierarchy smoothing for estimating rates of rare events
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
CloudVista: visual cluster exploration for extreme scale data in the cloud
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms
DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Scalable regression tree learning on Hadoop using OpenPlanet
Proceedings of third international workshop on MapReduce and its Applications Date
Intelligible models for classification and regression
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable random forests for massive data
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel approaches to machine learning-A comprehensive survey
Journal of Parallel and Distributed Computing
Training efficient tree-based models for document ranking
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages
Proceedings of the 22nd international conference on World Wide Web
Can we analyze big data inside a DBMS?
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
How to Improve Your Search Engine Ranking: Myths and Reality
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.