Large-scale incremental processing using distributed transactions and notifications

Authors:
Daniel Peng;Frank Dabek
Affiliations:
Google, Inc.;Google, Inc.
Venue:
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Year:
2010

Citing 22
Cited 65

Parallel database systems: the future of database processing or a passing fad?

ACM SIGMOD Record - Directions for future database research & development
A critique of ANSI SQL isolation levels

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Active database systems

ACM Computing Surveys (CSUR)
Concurrency Control in Distributed Database Systems

ACM Computing Surveys (CSUR)
Operating system support for database management

Communications of the ACM
Prototyping Bubba, A Highly Parallel Database System

IEEE Transactions on Knowledge and Data Engineering
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Sinfonia: a new paradigm for building scalable distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
ElasTraS: an elastic transactional data store in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing

Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A distributed resource management architecture for interconnecting Web-of-Things using uBox

Proceedings of the Second International Workshop on Web of Things
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
Mining large distributed log data in near real time

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Transactional storage for geo-replicated systems

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Incremental recomputations in MapReduce

Proceedings of the third international workshop on Cloud data management
Large-scale continuous subgraph queries on streams

Proceedings of the first annual workshop on High performance computing meets databases
Online workflow management and performance analysis with stampede

Proceedings of the 7th International Conference on Network and Services Management
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
A critique of snapshot isolation

Proceedings of the 7th ACM european conference on Computer Systems
LazyBase: trading freshness for performance in a scalable database

Proceedings of the 7th ACM european conference on Computer Systems
The datacenter needs an operating system

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Large-scale incremental data processing with change propagation

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
TransMR: data-centric programming beyond data parallelism

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
The evolving landscape of data management in the cloud

International Journal of Computational Science and Engineering
Abstract state machines for data-parallel computing

Conceptual Modelling and Its Theoretical Foundations
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
iMapReduce: A Distributed Computing Framework for Iterative Computation

Journal of Grid Computing
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Scalable Join Queries in Cloud Data Stores

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Stormy: an elastic and highly available streaming service in the cloud

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Using R for iterative and incremental processing

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Oolong: asynchronous distributed applications made easy

Proceedings of the Asia-Pacific Workshop on Systems
Auto-parallelizing stateful distributed streaming applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Oolong: asynchronous distributed applications made easy

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Data security services, solutions and standards for outsourcing

Computer Standards & Interfaces
Condos and clouds

Communications of the ACM
Condos and Clouds

Queue - Web Security
MapReduce-Based data stream processing over large history data

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
Streaming big data with self-adjusting computation

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
ElasTraS: An elastic, scalable, and self-managing transactional database for the cloud

ACM Transactions on Database Systems (TODS)
Cloud Platform Datastore Support

Journal of Grid Computing
Incremental stream processing using computational conflict-free replicated data types

Proceedings of the 3rd International Workshop on Cloud Data and Platforms
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Omega: flexible, scalable schedulers for large compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Brief announcement: towards a fully-articulated pessimistic distributed transactional memory

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Dynamic memory allocation policies for postings in real-time Twitter search

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
DAX: a widely distributed multitenant storage service for DBMS hosting

Proceedings of the VLDB Endowment
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Spanner: Google’s Globally Distributed Database

ACM Transactions on Computer Systems (TOCS)
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
Consolidated cluster systems for data centers in the cloud age: a survey and analysis

Frontiers of Computer Science: Selected Publications from Chinese Universities
Simplifying MapReduce data processing

International Journal of Computational Science and Engineering
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Tango: distributed data structures over a shared log

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Discretized streams: fault-tolerant streaming computation at scale

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Naiad: a timely dataflow system

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Exploring storage class memory with key value stores

Proceedings of the 1st Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads
Network-aware data caching and prefetching for cloud-hosted metadata retrieval

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
CORFU: A distributed shared log

ACM Transactions on Computer Systems (TOCS)
MillWheel: fault-tolerant stream processing at internet scale

Proceedings of the VLDB Endowment
Online, asynchronous schema change in F1

Proceedings of the VLDB Endowment
F1: a distributed SQL database that scales

Proceedings of the VLDB Endowment
Scalable transactions across heterogeneous NoSQL key-value data stores

Proceedings of the VLDB Endowment
Active data: a data-centric approach to data life-cycle management

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
Toward a scale-out data-management middleware for low-latency enterprise computing

IBM Journal of Research and Development

Quantified Score

Hi-index	0.02

Visualization

Abstract

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.