Mining Deviants in Time Series Data Streams

  • Authors:
  • S. Muthukrishnan;Rahul Shah;Jeffrey Scott Vitter

  • Affiliations:
  • Rutgers University;Purdue University;Purdue University

  • Venue:
  • SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the central tasks in managing, monitoring andmining data streams is that of identifying outliers. There isa long history of study of various outliers in statistics anddatabases, and a recent focus on mining outliers in datastreams. Here, we adopt the notion of "deviants" from Jagadishet al [Mining Deviants in a Time Series Database] as outliers. Deviants are based on one ofthe most fundamental statistical concept of standard deviation(or variance). Formally, deviants are defined basedon a representation sparsity metric, i.e., deviants are valueswhose removal from the dataset leads to an improvedcompressed representation of the remaining items. Thus, deviantsare not global maxima/minima, but rather these areappropriate local aberrations. Deviants are known to be ofgreat mining value in time series databases.We present first-known algorithms for identifying deviantson massive data streams. Our algorithms monitorstreams using very small space (polylogarithmic in datasize) and are able to quickly find deviants at any instant,as the data stream evolves over time. For all versions of thisproblem-uni- vs multivariate time series, optimal vs near-optimalvs heuristic solutions, offline vs streaming-our algorithmshave the same framework of maintaining a hierarchicalset of candidate deviants that are updated as the timeseries data gets progressively revealed. We show experimentallyusing real network traffic data (SNMP aggregate timeseries) as well as synthetic data that our algorithm is remarkablyaccurate in determining the deviants.