Exploiting in-network processing for big data management

Authors:
Lukas Rupprecht
Affiliations:
Imperial College, London, United Kingdom
Venue:
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Year:
2013

Citing 23
Cited 0

Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Data mining with the SAP NetWeaver BI accelerator

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Better tree - better fruits: using dominating set trees for MAX queries

Proceedings of the 5th workshop on Data management for sensor networks
A scalable, commodity data center network architecture

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
H-store: a high-performance, distributed main memory transaction processing system

Proceedings of the VLDB Endowment
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
BCube: a high performance, server-centric network architecture for modular data centers

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
SmartRE: an architecture for coordinated network-wide redundancy elimination

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
The case for RAMClouds: scalable high-performance storage entirely in DRAM

ACM SIGOPS Operating Systems Review
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Extreme scale with full SQL language support in microsoft SQL Azure

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Improving the scalability of data center networks with traffic-aware virtual machine placement

INFOCOM'10 Proceedings of the 29th conference on Information communications
SideCar: building programmable datacenter networks without programmable switches

Hotnets-IX Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
CloudNaaS: a cloud networking platform for enterprise applications

Proceedings of the 2nd ACM Symposium on Cloud Computing
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
SAP HANA database: data management for modern business applications

ACM SIGMOD Record
Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
NaaS: network-as-a-service in the cloud

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
RAMCube: exploiting network proximity for ram-based key-value store

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data processing systems face the task of efficiently storing and processing data at petabyte scale, with the amount set to increase in the future. To meet such a requirement, highly scalable, shared-nothing systems, e.g. Google's BigTable [6] or Facebook's Cassandra [14], are built to partition data and process it in parallel on distributed nodes in a cluster. This allows the handling of data at scale but introduces new challenges due to the distribution of data. Running queries involves a high network overhead because data has to be exchanged between cluster nodes and hence, the network becomes a critical part of the system. To avoid the network bottleneck, it is essential for distributed data processing systems (DDPS) to be aware of the network rather than treating it as a black box. We propose in-network processing as a way of achieving network-awareness to decrease bandwidth usage by custom routing, redundancy elimination, and on-path data reduction. Thereby, we can increase the query throughput of a DDPS. The challenges of an in-network processing system range from design issues, such as performance and transparency, to the integration with query optimisation and deployment in data centres. We formulate these challenges as possible research directions and provide a prototype implementation. Our preliminary results suggest that we can significantly improve query throughput in a DDPS by performing partial data reduction within the network.