A limited memory algorithm for bound constrained optimization
SIAM Journal on Scientific Computing
Collaborative filtering with privacy via factor analysis
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Modeling relationships at multiple scales to improve accuracy of large recommender systems
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Introduction to recommender systems
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Factorization meets the neighborhood: a multifaceted collaborative filtering model
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Collaborative filtering with temporal dynamics
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Up close and personalized: a marketing view of recommendation systems
Proceedings of the third ACM conference on Recommender systems
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Web data processing on the cloud
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Big data and cloud computing: current state and future opportunities
Proceedings of the 14th International Conference on Extending Database Technology
Large-scale matrix factorization with distributed stochastic gradient descent
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
Hazy: making it easier to build and maintain big-data analytics
Communications of the ACM
Hazy: Making it Easier to Build and Maintain Big-data Analytics
Queue - Web Development
Towards high-throughput gibbs sampling at scale: a study across storage managers
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cumulon: optimizing statistical data analysis in the cloud
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Presto: distributed machine learning and graph processing with sparse matrices
Proceedings of the 8th ACM European Conference on Computer Systems
A first view of exedra: a domain-specific language for large graph analytics workflows
Proceedings of the 22nd international conference on World Wide Web companion
pEDM: online-forecasting for smart energy analytics
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Big data challenge: a data management perspective
Frontiers of Computer Science: Selected Publications from Chinese Universities
Can we analyze big data inside a DBMS?
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Next generation data analytics at IBM research
Proceedings of the VLDB Endowment
Towards privacy-preserving computing on distributed electronic health record data
Proceedings of the 2013 Middleware Doctoral Symposium
A platform for eXtreme analytics
IBM Journal of Research and Development
Hi-index | 0.02 |
Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories poses a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data management systems - such as MapReduce-based systems - can scale to petabytes of data, but provide insufficient analytical functionality. We report our experiences in building Ricardo, a scalable platform for deep analytics. Ricardo is part of the eXtreme Analytics Platform (XAP) project at the IBM Almaden Research Center, and rests on a decomposition of data-analysis algorithms into parts executed by the R statistical analysis system and parts handled by the Hadoop data management system. This decomposition attempts to minimize the transfer of data across system boundaries. Ricardo contrasts with previous approaches, which try to get along with only one type of system, and allows analysts to work on huge datasets from within a popular, well supported, and powerful analysis environment. Because our approach avoids the need to re-implement either statistical or data-management functionality, it can be used to solve complex problems right now.