Unexpected challenges in large scale machine learning

Authors:
Charles Parker
Affiliations:
BigML, Inc., Corvallis, OR
Venue:
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Year:
2012

Citing 19
Cited 1

The weighted majority algorithm

Information and Computation
Learning in the presence of concept drift and hidden contexts

Machine Learning
Nonparametric Time Series Prediction Through Adaptive ModelSelection

Machine Learning
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
On-line Algorithms in Machine Learning

Developments from a June 1996 seminar on Online algorithms: the state of the art
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using GPUs for Machine Learning Algorithms

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

Proceedings of the 24th international conference on Machine learning
Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA

HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Adaptive concept drift detection

Statistical Analysis and Data Mining - Best of SDM'09
Stability Bounds for Stationary φ-mixing and β-mixing Processes

The Journal of Machine Learning Research
A Streaming Parallel Decision Tree Algorithm

The Journal of Machine Learning Research
Optimal online prediction in adversarial environments

DS'10 Proceedings of the 13th international conference on Discovery science
Large-scale matrix factorization with distributed stochastic gradient descent

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling up machine learning: parallel and distributed approaches

Proceedings of the 17th ACM SIGKDD International Conference Tutorials
HadoopPerceptron: a toolkit for distributed perceptron training and prediction with MapReduce

EACL '12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

Mining big data: current status, and forecast to the future

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

In machine learning, scale adds complexity. The most obvious consequence of scale is that data takes longer to process. At certain points, however, scale makes trivial operations costly, thus forcing us to re-evaluate algorithms in light of the complexity of those operations. Here, we will discuss one important way a general large scale machine learning setting may differ from the standard supervised classification setting and show some the results of some preliminary experiments highlighting this difference. The results suggest that there is potential for significant improvement beyond obvious solutions.