The unified logging infrastructure for data analytics at Twitter

Authors:
George Lee;Jimmy Lin;Chuang Liu;Andrew Lorek;Dmitriy Ryaboy
Affiliations:
Twitter, Inc.;Twitter, Inc.;Twitter, Inc.;Twitter, Inc.;Twitter, Inc.
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 26
Cited 7

Word association norms, mutual information, and lexicography

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Funnel report mining for the MSN network

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
A generative constituent-context model for improved grammar induction

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Modeling Online Browsing and Path Analysis Using Clickstream Data

Marketing Science
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Practical guide to controlled experiments on the web: listen to your customers not to the hippo

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Modeling actions of PubMed users with n-gram language models

Information Retrieval
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
LifeFlow: visualizing an overview of event sequences (video preview)

CHI '11 Extended Abstracts on Human Factors in Computing Systems
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
Distributed cube materialization on holistic measures

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
HadoopProv: towards provenance as a first class citizen in MapReduce

TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce

Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
WTF: the who to follow service at Twitter

Proceedings of the 22nd international conference on World Wide Web
CG_Hadoop: computational geometry in MapReduce

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper, we present Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize "session sequences", which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.