Challenges to error diagnosis in hadoop ecosystems

  • Authors:
  • Jim Li;Siyuan He;Liming Zhu;Xiwei Xu;Min Fu;Len Bass;Anna Liu;An Binh Tran

  • Affiliations:
  • NICTA, Sydney, Australia;Citibank, Toronto, Canada;NICTA, Sydney, Australia and School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;NICTA, Sydney, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;NICTA, Sydney, Australia and School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;NICTA, Sydney, Australia and School of Computer Science and Engineering, University of New South Wales, Sydney, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, Australia

  • Venue:
  • LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Deploying a large-scale distributed ecosystem such as HBase/Hadoop in the cloud is complicated and error-prone. Multiple layers of largely independently evolving software are deployed across distributed nodes on third party infrastructures. In addition to software incompatibility and typical misconfiguration within each layer, many subtle and hard to diagnose errors happen due to misconfigurations across layers and nodes. These errors are difficult to diagnose because of scattered log management and lack of ecosystem-awareness in many diagnosis tools and processes. We report on some failure experiences in a real world deployment of HBase/Hadoop and propose some initial ideas for better trouble-shooting during deployment. We identify the following types of subtle errors and the corresponding challenges in trouble-shooting: 1) dealing with inconsistency among distributed logs, 2) distinguishing useful information from noisy logging, and 3) probabilistic determination of root causes.