In-situ MapReduce for log processing

Authors:
Dionysios Logothetis;Chris Trezzo;Kevin C. Webb;Kenneth Yocum
Affiliations:
UCSD Department of Computer Science;Salesforce.com, Inc.;UCSD Department of Computer Science;UCSD Department of Computer Science
Venue:
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Year:
2011

Citing 24
Cited 1

Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Vivaldi: a decentralized network coordinate system

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable distributed information management system

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
High-Availability Algorithms for Distributed Stream Processing

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Flexible time management in data stream systems

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
No pane, no gain: efficient evaluation of sliding-window aggregates over data streams

ACM SIGMOD Record
TAG: a Tiny AGgregation service for Ad-Hoc sensor networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Window-aware load shedding for aggregation queries over data streams

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Monitoring streams: a new class of data management applications

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Querying the internet with PIER

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Staying FIT: efficient load shedding techniques for distributed stream processing

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Wide-scale data stream management

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
Volley: automated data placement for geo-distributed cloud services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mining console logs for large-scale system problem detection

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
Towards a dependable architecture for internetscale

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability

Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Log analytics are a bedrock component of running many of today's Internet sites. Application and click logs form the basis for tracking and analyzing customer behaviors and preferences, and they form the basic inputs to ad-targeting algorithms. Logs are also critical for performance and security monitoring, debugging, and optimizing the large compute infrastructures that make up the compute "cloud", thousands of machines spanning multiple data centers. With current log generation rates on the order of 1-10 MB/s per machine, a single data center can create tens of TBs of log data a day. While bulk data processing has proven to be an essential tool for log processing, current practice transfers all logs to a centralized compute cluster. This not only consumes large amounts of network and disk bandwidth, but also delays the completion of time-sensitive analytics. We present an in-situ MapReduce architecture that mines data "on location", bypassing the cost and wait time of this store-first-query-later approach. Unlike current approaches, our architecture explicitly supports reduced data fidelity, allowing users to annotate queries with latency and fidelity requirements. This approach fills an important gap in current bulk processing systems, allowing users to trade potential decreases in data fidelity for improved response times or reduced load on end systems. We report on the design and implementation of our in-situ MapReduce architecture, and illustrate how it improves our ability to accommodate increasing log generation rates.