Word association norms, mutual information, and lexicography
Computational Linguistics
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Funnel report mining for the MSN network
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
A generative constituent-context model for improved grammar induction
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Speech and Language Processing (2nd Edition)
Speech and Language Processing (2nd Edition)
Modeling Online Browsing and Path Analysis Using Clickstream Data
Marketing Science
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Practical guide to controlled experiments on the web: listen to your customers not to the hippo
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Modeling actions of PubMed users with n-gram language models
Information Retrieval
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Data warehousing and analytics infrastructure at facebook
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
LifeFlow: visualizing an overview of event sequences (video preview)
CHI '11 Extended Abstracts on Human Factors in Computing Systems
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Apache hadoop goes realtime at Facebook
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
Distributed cube materialization on holistic measures
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Trojan data layouts: right shoes for a running elephant
Proceedings of the 2nd ACM Symposium on Cloud Computing
Large-scale machine learning at twitter
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The big data ecosystem at LinkedIn
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience
ACM SIGKDD Explorations Newsletter
HadoopProv: towards provenance as a first class citizen in MapReduce
TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
HadoopProv: towards provenance as a first class citizen in MapReduce
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
WTF: the who to follow service at Twitter
Proceedings of the 22nd international conference on World Wide Web
CG_Hadoop: computational geometry in MapReduce
Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Hi-index | 0.00 |
In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper, we present Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize "session sequences", which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.