Filtering log data: Finding the needles in the Haystack

  • Authors:
  • Li Yu;Ziming Zheng;Zhiling Lan;Terry Jones;Jim M. Brandt;Ann C. Gentile

  • Affiliations:
  • Illinois Institute of Technology, Chicago, 60616, USA;Illinois Institute of Technology, Chicago, 60616, USA;Illinois Institute of Technology, Chicago, 60616, USA;Oak Ridge National Laboratory, TN 37831, USA;Sandia National Laboratories, Livermore, CA 94551, USA;Sandia National Laboratories, Livermore, CA 94551, USA

  • Venue:
  • DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Log data is an incredible asset for troubleshooting in large-scale systems. Nevertheless, due to the ever-growing system scale, the volume of such data becomes overwhelming, bringing enormous burdens on both data storage and data analysis. To address this problem, we present a 2-dimensional online filtering mechanism to remove redundant and noisy data via feature selection and instance selection. The objective of this work is two-fold: (i) to significantly reduce data volume without losing important information, and (ii) to effectively promote data analysis. We evaluate this new filtering mechanism by means of real environmental data from the production supercomputers at Oak Ridge National Laboratory and Sandia National Laboratory. Our preliminary results demonstrate that our method can reduce more than 85% disk space, thereby significantly reducing analysis time. Moreover, it also facilitates better failure prediction and diagnosis by more than 20%, as compared to the conventional predictive approach relying on RAS (Reliability, Availability, and Serviceability) events alone.