Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
IEEE Internet Computing
The many faces of publish/subscribe
ACM Computing Surveys (CSUR)
The link prediction problem for social networks
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Understanding MySQL Internals
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Efficient bulk insertion into a distributed ordered table
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Predicting tie strength with social media
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Efficient type-ahead search on relational data: a TASTIER approach
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed indexing of web scale datasets for the cloud
Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Data warehousing and analytics infrastructure at facebook
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Parallel bulk insertion for large-scale analytics applications
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Distributed indexing for semantic search
Proceedings of the 3rd International Semantic Search Workshop
The YouTube video recommendation system
Proceedings of the fourth ACM conference on Recommender systems
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Chukwa: a system for reliable large-scale log collection
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Hadoop: The Definitive Guide
Apache hadoop goes realtime at Facebook
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A batch of PNUTS: experiences connecting cloud batch and serving systems
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distributed cube materialization on holistic measures
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Mahout in Action
Serving large-scale batch computed data with project Voldemort
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Large-scale machine learning at twitter
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Avatara: OLAP for web-scale analytics products
Proceedings of the VLDB Endowment
Metaphor: a system for related search recommendations
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the ``last mile'' issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.