G2: a graph processing system for diagnosing distributed systems

Authors:
Zhenyu Guo;Dong Zhou;Haoxiang Lin;Mao Yang;Fan Long;Chaoqiang Deng;Changshu Liu;Lidong Zhou
Affiliations:
Microsoft Research Asia;Tsinghua University;Microsoft Research Asia;Microsoft Research Asia;Tsinghua University;Harbin Institute of Technology;Microsoft Research Asia;Microsoft Research Asia
Venue:
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Year:
2011

Citing 21
Cited 2

Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Dynamic instrumentation of production systems

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Hierarchical dynamic slicing

Proceedings of the 2007 international symposium on Software testing and analysis
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Automated software testing as a service

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Experiences with tracing causality in networked services

INM/WREN'10 Proceedings of the 2010 internet network management conference on Research on enterprise networking
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
VScope: middleware for troubleshooting time-sensitive data center applications

Proceedings of the 13th International Middleware Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

G2 is a graph processing system for diagnosing distributed systems. It works on execution graphs that model runtime events and their correlations in distributed systems. In G2, a diagnosis process involves a series of queries, expressed in a high-level declarative language that supports both relational and graph-based operators. Each query is compiled into a distributed execution. G2's execution engine supports both parallel relational data processing and iterative graph traversal. Execution graphs in G2 tend to have long paths and are in structure distinctly different from other large-scale graphs, such as social or web graphs. Tailored for execution graphs and graph traversal operations on those graphs, G2's graph engine distinguishes itself by embracing batched asynchronous iterations that allows for better parallelism without barriers, and by enabling partition-level states and aggregation. We have applied G2 to diagnosis of distributed systems such as Berkeley DB, SCOPE/Dryad, and G2 itself to validate its effectiveness. When co-deployed on a 60- machine cluster, G2's execution engine can handle execution graphs with millions of vertices and edges; for instance, using a query in G2, we traverse, filter, and summarize a 130 million-vertex graph into a 12 thousandvertex graph within 268 seconds on 60 machines. The use of an asynchronous model and a partition-level interface delivered a 66% reduction in response time when applied to queries in our diagnosis tasks.