Challenges to error diagnosis in hadoop ecosystems

Authors:
Jim Li;Siyuan He;Liming Zhu;Xiwei Xu;Min Fu;Len Bass;Anna Liu;An Binh Tran
Affiliations:
NICTA, Sydney, Australia;Citibank, Toronto, Canada;NICTA, Sydney, Australia and School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;NICTA, Sydney, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;NICTA, Sydney, Australia and School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;NICTA, Sydney, Australia and School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Venue:
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Year:
2013

Citing 13
Cited 0

Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
An empirical study on configuration errors in commercial and open source systems

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
System problem detection by mining console logs

System problem detection by mining console logs
Structured comparative analysis of systems logs to diagnose performance problems

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Characterizing logging practices in open-source software

Proceedings of the 34th International Conference on Software Engineering
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Modeling and analysing operation processes for dependability

DSN '13 Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Using program analysis to reduce misconfiguration in open source systems software

Using program analysis to reduce misconfiguration in open source systems software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deploying a large-scale distributed ecosystem such as HBase/Hadoop in the cloud is complicated and error-prone. Multiple layers of largely independently evolving software are deployed across distributed nodes on third party infrastructures. In addition to software incompatibility and typical misconfiguration within each layer, many subtle and hard to diagnose errors happen due to misconfigurations across layers and nodes. These errors are difficult to diagnose because of scattered log management and lack of ecosystem-awareness in many diagnosis tools and processes. We report on some failure experiences in a real world deployment of HBase/Hadoop and propose some initial ideas for better trouble-shooting during deployment. We identify the following types of subtle errors and the corresponding challenges in trouble-shooting: 1) dealing with inconsistency among distributed logs, 2) distinguishing useful information from noisy logging, and 3) probabilistic determination of root causes.