Scaling to very very large corpora for natural language disambiguation
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search
ACM Transactions on Information Systems (TOIS)
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Practical guide to controlled experiments on the web: listen to your customers not to the hippo
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
The Unreasonable Effectiveness of Data
IEEE Intelligent Systems
Fast, easy, and cheap: construction of statistical machine translation models with MapReduce
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
MapReduce: a flexible data processing tool
Communications of the ACM - Amir Pnueli: Ahead of His Time
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data warehousing and analytics infrastructure at facebook
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
Distributed training strategies for the structured perceptron
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Gestalt: integrated support for implementation and analysis in machine learning
UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
SystemML: Declarative machine learning on MapReduce
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Learning to Rank for Information Retrieval and Natural Language Processing
Learning to Rank for Information Retrieval and Natural Language Processing
High-precision phrase-based document classification on a modern scale
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting adversarial advertisements in the wild
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Trojan data layouts: right shoes for a running elephant
Proceedings of the 2nd ACM Symposium on Cloud Computing
Efficient and effective spam filtering and re-ranking for large web datasets
Information Retrieval
COMET: A Recipe for Learning and Using Large Ensembles on Massive Data
ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Large-scale machine learning at twitter
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Trustworthy online controlled experiments: five puzzling outcomes explained
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Processing a trillion cells per mouse click
Proceedings of the VLDB Endowment
The MADlib analytics library: or MAD skills, the SQL
Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.