Checks and balances: monitoring data quality problems in network traffic databases

Authors:
Flip Korn;S. Muthukrishnan;Yunyue Zhu
Affiliations:
AT&T Labs-Research, NJ;Rutgers University and AT&T Labs-Research, Piscataway, NJ;New York University, New York, NY
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 10
Cited 8

Nondeterministic dependencies in relations: an extention of the concept of functional dependency

Information Systems
Data quality: management and technology

Data quality: management and technology
Algorithms for inferring functional dependencies from relations

Data & Knowledge Engineering
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Extraction and Applications of Statistical Relationships in Relational Databases

IEEE Transactions on Knowledge and Data Engineering
A signal analysis of network traffic anomalies

Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment
Efficient Discovery of Functional and Approximate Dependencies Using Partitions

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Exploratory Data Mining and Data Cleaning

Exploratory Data Mining and Data Cleaning

Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Optimizing away joins on data streams

SSPS '08 Proceedings of the 2nd international workshop on Scalable stream processing system
Range Medians

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Incorporating Domain-Specific Information Quality Constraints into Database Queries

Journal of Data and Information Quality (JDIQ)
Generating data quality rules and integration into ETL process

Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
Sequential dependencies

Proceedings of the VLDB Endowment
Stream schema: providing and exploiting static metadata for data stream processing

Proceedings of the 13th International Conference on Extending Database Technology
Differential dependencies: Reasoning and discovery

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Internet Service Providers (ISPs) use real-time data feeds of aggregated traffic in their network to support technical as well as business decisions. A fundamental difficulty with building decision support tools based on aggregated traffic data feeds is one of data quality. Data quality problems stem from network-specific issues (irregular polling caused by UDP packet drops and delays, topological mislabelings, etc.) and make it difficult to distinguish between artifacts and actual phenomena, rendering data analysis based on such data feeds ineffective. In principle, traditional integrity constraints and triggers may be used to enforce data quality. In practice, data cleaning is done outside the database and is ad-hoc. Unfortunately, these approaches are too rigid and limited for the subtle data quality problems arising from network data where existing problems morph with network dynamics, new problems emerge over time, and poor quality data in a local region may itself indicate an important phenomenon in the underlying network. We need a new approach - both in principle and in practice - to face data quality problems in network traffic databases. We propose a continuous data quality monitoring approach based on probabilistic, approximate constraints (PACs). These are simple, user-specified rule templates with open parameters for tolerance and likelihood. We use statistical techniques to instantiate suitable parameter values from the data, and show how to apply them for monitoring data quality.In principle, our PAC-based approach can be applied to data quality problems in any data feed. We present PAC-Man, which is the system that manages PACs for the entire aggregate network traffic database in a large ISP, and show that it is very effective in monitoring data quality problems.