Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Authors:
Gilad Mishne;Jeff Dalton;Zhenghua Li;Aneesh Sharma;Jimmy Lin
Affiliations:
Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA;Twitter, San Francisco, CA, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 42
Cited 0

A case for interaction: a study of interactive information retrieval behavior and effectiveness

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Improving the effectiveness of information retrieval with local context analysis

ACM Transactions on Information Systems (TOIS)
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Query Expansion by Mining User Logs

IEEE Transactions on Knowledge and Data Engineering
Time-based language models

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Identifying similarities, periodicities and bursts for online search queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Data Streams: Models and Algorithms (Advances in Database Systems)

Data Streams: Models and Algorithms (Advances in Database Systems)
Temporal profiles of queries

ACM Transactions on Information Systems (TOIS)
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Monitoring streams: a new class of data management applications

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Context-aware query suggestion by mining click-through and session data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Query suggestion using hitting time

Proceedings of the 17th ACM conference on Information and knowledge management
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Data warehousing and analytics infrastructure at facebook

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Continuous analytics over discontinuous streams

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
S4: Distributed Stream Computing Platform

ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
The effects of time on query flow graph-based models for query suggestion

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
Bagging gradient-boosted trees for high precision, low variance ranking models

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Estimation methods for ranking recent information

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Detecting seasonal queries by time-series analysis

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Learning to Rank for Information Retrieval and Natural Language Processing

Learning to Rank for Information Retrieval and Natural Language Processing
Automatic management of partitioned, replicated search services

Proceedings of the 2nd ACM Symposium on Cloud Computing
Answering General Time-Sensitive Queries

IEEE Transactions on Knowledge and Data Engineering
Modeling and predicting behavioral dynamics on the web

Proceedings of the 21st international conference on World Wide Web
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Earlybird: Real-Time Search at Twitter

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Temporal Analytics on Big Data for Web Advertising

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Time-sensitive query auto-completion

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Muppet: MapReduce-style processing of fast data

Proceedings of the VLDB Endowment
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production today, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.