Sublinear estimation of entropy and information distances

Authors:
Sudipto Guha;Andrew McGregor;Suresh Venkatasubramanian
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Massachusetts, Amherst, MA;University of Utah, Salt Lake City, UT
Venue:
ACM Transactions on Algorithms (TALG)
Year:
2009

Citing 25
Cited 4

Elements of information theory

Elements of information theory
On the learnability of discrete distributions

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Computing on data streams

External memory algorithms
Sampling algorithms: lower bounds and applications

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
An Approximate L1-Difference Algorithm for Massive Data Streams

SIAM Journal on Computing
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Testing that distributions are close

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Profiling internet backbone traffic: behavior models and applications

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
The Complexity of Approximating the Entropy

SIAM Journal on Computing
Entropy Based Worm and Anomaly Detection in Fast IP Networks

WETICE '05 Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise
Streaming and sublinear approximation of entropy and information distances

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The complexity of massive data set computations

The complexity of massive data set computations
Data streaming algorithms for estimating entropy of network traffic

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Approximate quantiles and the order of the stream

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On stationarity in Internet measurements through an information-theoretic lens

ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Detecting anomalies in network traffic using maximum entropy estimation

IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Estimating entropy over data streams

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A near-optimal algorithm for computing the entropy of a stream

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Estimating entropy and entropy norm on data streams

STACS'06 Proceedings of the 23rd Annual conference on Theoretical Aspects of Computer Science
Some inequalities for information divergence and related measures of discrimination

IEEE Transactions on Information Theory
Lower bounds for quantile estimation in random-order and multi-pass streaming

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

Taming big probability distributions

XRDS: Crossroads, The ACM Magazine for Students - Big Data
Testing Symmetric Properties of Distributions

SIAM Journal on Computing
On the power of conditional samples in distribution testing

Proceedings of the 4th conference on Innovations in Theoretical Computer Science
Testing Closeness of Discrete Distributions

Journal of the ACM (JACM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓp distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice. We present a range of sublinear-time algorithms in various oracle models in which the algorithm accesses the data via an oracle that supports various queries. In particular, we answer a question posed by Batu et al. on testing whether two distributions are close in an information-theoretic sense given independent samples. We then present optimal algorithms for estimating various information-divergences and entropy with a more powerful oracle called the combined oracle that was also considered by Batu et al. Finally, we consider sublinear-space algorithms for these quantities in the data-stream model. In the course of doing so, we explore the relationship between the aforementioned oracle models and the data-stream model. This continues work initiated by Feigenbaum et al. An important additional component to the study is considering data streams that are ordered randomly rather than just those which are ordered adversarially.