Ricardo: integrating R and Hadoop

Authors:
Sudipto Das;Yannis Sismanis;Kevin S. Beyer;Rainer Gemulla;Peter J. Haas;John McPherson
Affiliations:
University of California, Santa Barbara, USA;IBM Almaden Research Center, San Jose, USA;IBM Almaden Research Center, San Jose, USA;IBM Almaden Research Center, San Jose, USA;IBM Almaden Research Center, San Jose, USA;IBM Almaden Research Center, San Jose, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 12
Cited 18

A limited memory algorithm for bound constrained optimization

SIAM Journal on Scientific Computing
Collaborative filtering with privacy via factor analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Modeling relationships at multiple scales to improve accuracy of large recommender systems

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Introduction to recommender systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Factorization meets the neighborhood: a multifaceted collaborative filtering model

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Collaborative filtering with temporal dynamics

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Matrix Factorization Techniques for Recommender Systems

Computer
Up close and personalized: a marketing view of recommendation systems

Proceedings of the third ACM conference on Recommender systems
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment

Web data processing on the cloud

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Big data and cloud computing: current state and future opportunities

Proceedings of the 14th International Conference on Extending Database Technology
Large-scale matrix factorization with distributed stochastic gradient descent

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Hazy: making it easier to build and maintain big-data analytics

Communications of the ACM
Hazy: Making it Easier to Build and Maintain Big-data Analytics

Queue - Web Development
Towards high-throughput gibbs sampling at scale: a study across storage managers

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Presto: distributed machine learning and graph processing with sparse matrices

Proceedings of the 8th ACM European Conference on Computer Systems
A first view of exedra: a domain-specific language for large graph analytics workflows

Proceedings of the 22nd international conference on World Wide Web companion
pEDM: online-forecasting for smart energy analytics

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Big data challenge: a data management perspective

Frontiers of Computer Science: Selected Publications from Chinese Universities
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Next generation data analytics at IBM research

Proceedings of the VLDB Endowment
Towards privacy-preserving computing on distributed electronic health record data

Proceedings of the 2013 Middleware Doctoral Symposium
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.02

Visualization

Abstract

Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories poses a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data management systems - such as MapReduce-based systems - can scale to petabytes of data, but provide insufficient analytical functionality. We report our experiences in building Ricardo, a scalable platform for deep analytics. Ricardo is part of the eXtreme Analytics Platform (XAP) project at the IBM Almaden Research Center, and rests on a decomposition of data-analysis algorithms into parts executed by the R statistical analysis system and parts handled by the Hadoop data management system. This decomposition attempts to minimize the transfer of data across system boundaries. Ricardo contrasts with previous approaches, which try to get along with only one type of system, and allows analysts to work on huge datasets from within a popular, well supported, and powerful analysis environment. Because our approach avoids the need to re-implement either statistical or data-management functionality, it can be used to solve complex problems right now.