Pollux: towards scalable distributed real-time search on microblogs

Authors:
Liwei Lin;Xiaohui Yu;Nick Koudas
Affiliations:
Shandong University, Jinan, China;Shandong University, Jinan, China and York University, Toronto, ON, Canada;University of Toronto, Toronto, ON, Canada
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 22
Cited 0

Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems

The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
High-Availability Algorithms for Distributed Stream Processing

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
SPC: a distributed, scalable platform for data mining

Proceedings of the 4th international workshop on Data mining standards, services and platforms
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Fast and Highly-Available Stream Processing over Wide Area Networks

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Real time search user behavior

CHI '10 Extended Abstracts on Human Factors in Computing Systems
Time is of the essence: improving recency ranking using Twitter data

Proceedings of the 19th international conference on World wide web
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
An empirical study on learning to rank of tweets

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Scalable storage support for data stream processing

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
S4: Distributed Stream Computing Platform

ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Using Paxos to build a scalable, consistent, and highly available datastore

Proceedings of the VLDB Endowment
TI: an efficient indexing mechanism for real-time search on tweets

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Earlybird: Real-Time Search at Twitter

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The last few years have witnessed a meteoric rise of microblogging platforms, such as Twitter and Tumblr. The sheer volume of the microblog data and its highly dynamic nature present unique technical challenges for the platforms that provide search services. In particular, the search service must provide real-time response to queries, and continuously update the results as new microblogs are posted. Conventional approaches either cannot keep up with the high update rate, or cannot scale well to handle the large volume of data. We propose Pollux, a system that provides distributed real-time indexing and search service on microblogs. It adopts the distributed stream processing paradigm advocated by the recently developed platforms that are designed for real-time processing of large volume of data, such as Apache S4 and Twitter Storm. Although those open-source platforms have found successful applications in production environments, they lack some critical features required for real-time search. In particular: (1) they only implement partial fault tolerance, and do not provide lossless recovery in the event of a node failure, and (2) they do not have a facility for storing global data, which is necessary in efficiently ranking search results. Addressing those problems, Pollux extends current platforms in two important ways. First, we propose a failover strategy that can ensure high system availability and no data/state loss in the event of a node failure. Second, Pollux adds a global storage facility that supports convenient, efficient, and reliable data storage for shared data. We describe how to apply Pollux to the task of real-time search. We implement Pollux based on Apache S4, and show through extensive experiments on a Twitter dataset that the proposed solutions are effective, and Pollux can achieve excellent scalability.