Elements of information theory
Elements of information theory
On the learnability of discrete distributions
STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
External memory algorithms
Sampling algorithms: lower bounds and applications
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
An Approximate L1-Difference Algorithm for Massive Data Streams
SIAM Journal on Computing
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries
Proceedings of the 27th International Conference on Very Large Data Bases
Frequency Estimation of Internet Packet Streams with Limited Space
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Testing that distributions are close
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
Profiling internet backbone traffic: behavior models and applications
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
The Complexity of Approximating the Entropy
SIAM Journal on Computing
Entropy Based Worm and Anomaly Detection in Fast IP Networks
WETICE '05 Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise
Streaming and sublinear approximation of entropy and information distances
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The complexity of massive data set computations
The complexity of massive data set computations
Data streaming algorithms for estimating entropy of network traffic
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Approximate quantiles and the order of the stream
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On stationarity in Internet measurements through an information-theoretic lens
ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops
Clustering with Bregman Divergences
The Journal of Machine Learning Research
Detecting anomalies in network traffic using maximum entropy estimation
IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Estimating entropy over data streams
ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A near-optimal algorithm for computing the entropy of a stream
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Estimating entropy and entropy norm on data streams
STACS'06 Proceedings of the 23rd Annual conference on Theoretical Aspects of Computer Science
Some inequalities for information divergence and related measures of discrimination
IEEE Transactions on Information Theory
Lower bounds for quantile estimation in random-order and multi-pass streaming
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Taming big probability distributions
XRDS: Crossroads, The ACM Magazine for Students - Big Data
Testing Symmetric Properties of Distributions
SIAM Journal on Computing
On the power of conditional samples in distribution testing
Proceedings of the 4th conference on Innovations in Theoretical Computer Science
Testing Closeness of Discrete Distributions
Journal of the ACM (JACM)
Hi-index | 0.00 |
In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓp distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice. We present a range of sublinear-time algorithms in various oracle models in which the algorithm accesses the data via an oracle that supports various queries. In particular, we answer a question posed by Batu et al. on testing whether two distributions are close in an information-theoretic sense given independent samples. We then present optimal algorithms for estimating various information-divergences and entropy with a more powerful oracle called the combined oracle that was also considered by Batu et al. Finally, we consider sublinear-space algorithms for these quantities in the data-stream model. In the course of doing so, we explore the relationship between the aforementioned oracle models and the data-stream model. This continues work initiated by Feigenbaum et al. An important additional component to the study is considering data streams that are ordered randomly rather than just those which are ordered adversarially.