The big data ecosystem at LinkedIn

Authors:
Roshan Sumbaly;Jay Kreps;Sam Shah
Affiliations:
LinkedIn, Mountain View, CA, USA;LinkedIn, Mountain View, CA, USA;LinkedIn, Mountain View, CA, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 30
Cited 0

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Amazon.com Recommendations: Item-to-Item Collaborative Filtering

IEEE Internet Computing
The many faces of publish/subscribe

ACM Computing Surveys (CSUR)
The link prediction problem for social networks

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Understanding MySQL Internals

Understanding MySQL Internals
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Efficient bulk insertion into a distributed ordered table

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Predicting tie strength with social media

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Efficient type-ahead search on relational data: a TASTIER approach

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed indexing of web scale datasets for the cloud

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Parallel bulk insertion for large-scale analytics applications

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Distributed indexing for semantic search

Proceedings of the 3rd International Semantic Search Workshop
The YouTube video recommendation system

Proceedings of the fourth ACM conference on Recommender systems
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A batch of PNUTS: experiences connecting cloud batch and serving systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distributed cube materialization on holistic measures

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Mahout in Action

Mahout in Action
Serving large-scale batch computed data with project Voldemort

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Avatara: OLAP for web-scale analytics products

Proceedings of the VLDB Endowment
Metaphor: a system for related search recommendations

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the ``last mile'' issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.