In-situ MapReduce for log processing

  • Authors:
  • Dionysios Logothetis;Chris Trezzo;Kevin C. Webb;Kenneth Yocum

  • Affiliations:
  • UCSD Department of Computer Science;Salesforce.com, Inc.;UCSD Department of Computer Science;UCSD Department of Computer Science

  • Venue:
  • HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Log analytics are a bedrock component of running many of today's Internet sites. Application and click logs form the basis for tracking and analyzing customer behaviors and preferences, and they form the basic inputs to ad-targeting algorithms. Logs are also critical for performance and security monitoring, debugging, and optimizing the large compute infrastructures that make up the compute "cloud", thousands of machines spanning multiple data centers. With current log generation rates on the order of 1-10 MB/s per machine, a single data center can create tens of TBs of log data a day. While bulk data processing has proven to be an essential tool for log processing, current practice transfers all logs to a centralized compute cluster. This not only consumes large amounts of network and disk bandwidth, but also delays the completion of time-sensitive analytics. We present an in-situ MapReduce architecture that mines data "on location", bypassing the cost and wait time of this store-first-query-later approach. Unlike current approaches, our architecture explicitly supports reduced data fidelity, allowing users to annotate queries with latency and fidelity requirements. This approach fills an important gap in current bulk processing systems, allowing users to trade potential decreases in data fidelity for improved response times or reduced load on end systems. We report on the design and implementation of our in-situ MapReduce architecture, and illustrate how it improves our ability to accommodate increasing log generation rates.