The "DGX" distribution for mining massive, skewed data

  • Authors:
  • Zhiqiang Bi;Christos Faloutsos;Flip Korn

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University;AT&T Labs-Research

  • Venue:
  • Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, us-age data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection.