Profiling network performance for multi-tier data center applications

Authors:
Minlan Yu;Albert Greenberg;Dave Maltz;Jennifer Rexford;Lihua Yuan;Srikanth Kandula;Changhoon Kim
Affiliations:
Princeton University;Microsoft;Microsoft;Princeton University;Microsoft;Microsoft;Microsoft
Venue:
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Year:
2011

Citing 14
Cited 9

Automated packet trace analysis of TCP implementations

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Web protocols and practice: HTTP/1.1, Networking protocols, caching, and traffic measurement

Web protocols and practice: HTTP/1.1, Networking protocols, caching, and traffic measurement
On the characteristics and origins of internet flow rates

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Jigsaw: solving the puzzle of enterprise 802.11 analysis

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Towards highly reliable enterprise network services via inference of multi-level dependencies

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Answering what-if deployment and configuration questions with wise

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Detailed diagnosis in enterprise networks

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Safe and effective fine-grained TCP retransmissions for datacenter communication

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
Data center TCP (DCTCP)

Proceedings of the ACM SIGCOMM 2010 conference
X-trace: a pervasive network tracing framework

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Identifying performance bottlenecks in CDNs through TCP-level monitoring

Proceedings of the first ACM SIGCOMM workshop on Measurements up the stack
Synergy2cloud: introducing cross-sharing of application experiences into the cloud management cycle

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
X-ray: automating root-cause diagnosis of performance anomalies in production software

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Distributed time-aware provenance

Proceedings of the VLDB Endowment
Evaluating MapReduce for profiling application traffic

Proceedings of the first edition workshop on High performance and programmable networking
Adaptive monitoring: a framework to adapt passive monitoring using probing

Proceedings of the 8th International Conference on Network and Service Management
Virtual network diagnosis as a service

Proceedings of the 4th annual Symposium on Cloud Computing
Real-time diagnosis of TCP performance in clouds

Proceedings of the 2013 workshop on Student workhop
Catch the whole lot in an action: rapid precise packet loss notification in data centers

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Network performance problems are notoriously tricky to diagnose, and this is magnified when applications are often split into multiple tiers of application components spread across thousands of servers in a data center. Problems often arise in the communication between the tiers, where either the application or the network (or both!) could be to blame. In this paper, we present SNAP, a scalable network-application profiler that guides developers in identifying and fixing performance problems. SNAP passively collects TCP statistics and socket-call logs with low computation and storage overhead, and correlates across shared resources (e.g., host, link, switch) and connections to pinpoint the location of the problem (e.g., send buffer mismanagement, TCP/application conflicts, application-generated microbursts, or network congestion). Our one-week deployment of SNAP in a production data center (with over 8,000 servers and over 700 application components) has already helped developers uncover 15 major performance problems in application software, the network stack on the server, and the underlying network.