Juggling the Jigsaw: towards automated problem inference from network trouble tickets

Authors:
Rahul Potharaju;Navendu Jain;Cristina Nita-Rotaru
Affiliations:
Purdue University;Microsoft Research;Purdue University
Venue:
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Year:
2013

Citing 41
Cited 3

Word association norms, mutual information, and lexicography

Computational Linguistics
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Toward principles for the design of ontologies used for knowledge sharing

International Journal of Human-Computer Studies - Special issue: the role of formal ontology in the information technology
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Summarizing text documents: sentence selection and evaluation metrics

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient string matching: an aid to bibliographic search

Communications of the ACM
An Expert System for Real Time Fault Diagnosis of the Italian Telecommunications Network

Proceedings of the IFIP TC6/WG6.6 Third International Symposium on Integrated Network Management with participation of the IEEE Communications Society CNOM and with support from the Institute for Educational Services
Experimental Study of Internet Stability and Backbone Failures

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
MARSYAS: a framework for audio analysis

Organised Sound
The TIPSTER SUMMAC Text Summarization Evaluation

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
IP forwarding anomalies and improving their detection using multiple data sources

Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
Dynamic syslog mining for network failure monitoring

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Extraction of Chinese compound words: an experimental study on a very large corpus

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
NLTK: the Natural Language Toolkit

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Detection of Duplicate Defect Reports Using Natural Language Processing

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Diagnosing network disruptions with network-wide analysis

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Technique for High-Performance Data Compression

Computer
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
Modeling bug report quality

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Extracting structural information from bug reports

Proceedings of the 2008 international working conference on Mining software repositories
Introduction to Information Retrieval

Introduction to Information Retrieval
What's going on?: learning communication rules in edge networks

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Towards the next generation of bug tracking systems

VLHCC '08 Proceedings of the 2008 IEEE Symposium on Visual Languages and Human-Centric Computing
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Anomaly extraction in backbone networks using association rules

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
California fault lines: understanding the causes and impact of network failures

Proceedings of the ACM SIGCOMM 2010 conference
Open information extraction using Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
What happened in my network: mining network events from router syslogs

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
KenLM: faster and smaller language model queries

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

An empirical analysis of intra- and inter-datacenter network failures for geo-distributed services

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Proceedings of the 2013 conference on Internet measurement conference
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents NetSieve, a system that aims to do automated problem inference from network trouble tickets. Network trouble tickets are diaries comprising fixed fields and free-form text written by operators to document the steps while troubleshooting a problem. Unfortunately, while tickets carry valuable information for network management, analyzing them to do problem inference is extremely difficult--fixed fields are often inaccurate or incomplete, and the free-form text is mostly written in natural language. This paper takes a practical step towards automatically analyzing natural language text in network tickets to infer the problem symptoms, troubleshooting activities and resolution actions. Our system, NetSieve, combines statistical natural language processing (NLP), knowledge representation, and ontology modeling to achieve these goals. To cope with ambiguity in free-form text, NetSieve leverages learning from human guidance to improve its inference accuracy. We evaluate NetSieve on 10K+ tickets from a large cloud provider, and compare its accuracy using (a) an expert review, (b) a study with operators, and (c) vendor data that tracks device replacement and repairs. Our results show that NetSieve achieves 89%-100% accuracy and its inference output is useful to learn global problem trends. We have used NetSieve in several key network operations: analyzing device failure trends, understanding why network redundancy fails, and identifying device problem symptoms.