Learning and generalization with the information bottleneck

Authors:
Ohad Shamir;Sivan Sabato;Naftali Tishby
Affiliations:
School of Computer Science and Engineering, The Hebrew University, Jerusalem 91904, Israel;School of Computer Science and Engineering, The Hebrew University, Jerusalem 91904, Israel and IBM Research Laboratory in Haifa, Haifa 31905, Israel;School of Computer Science and Engineering, The Hebrew University, Jerusalem 91904, Israel and Interdisciplinary Center for Neural Computation, The Hebrew University, Jerusalem 91904, Israel
Venue:
Theoretical Computer Science
Year:
2010

Citing 8
Cited 2

Elements of information theory

Elements of information theory
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Convergence properties of functional estimates for discrete distributions

Random Structures & Algorithms - Special issue on analysis of algorithms dedicated to Don Knuth on the occasion of his (100)8th birthday
Multivariate Information Bottleneck

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Estimation of entropy and mutual information

Neural Computation
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Information Bottleneck for Gaussian Variables

The Journal of Machine Learning Research

High-precision phrase-based document classification on a modern scale

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Mutual information evaluation: A way to predict the performance of feature weighting on clustering

Intelligent Data Analysis

Quantified Score

Hi-index	5.23

Visualization

Abstract

The Information Bottleneck is an information theoretic framework that finds concise representations for an 'input' random variable that are as relevant as possible for an 'output' random variable. This framework has been used successfully in various supervised and unsupervised applications. However, its learning theoretic properties and justification remained unclear as it differs from standard learning models in several crucial aspects, primarily its explicit reliance on the joint input-output distribution. In practice, an empirical plug-in estimate of the underlying distribution has been used, so far without any finite sample performance guarantees. In this paper we present several formal results that address these difficulties. We prove several finite sample bounds, which show that the information bottleneck can provide concise representations with good generalization, based on smaller sample sizes than needed to estimate the underlying distribution. The bounds are non-uniform and adaptive to the complexity of the specific model chosen. Based on these results, we also present a preliminary analysis on the possibility of analyzing the information bottleneck method as a learning algorithm in the familiar performance-complexity tradeoff framework. In addition, we formally describe the connection between the information bottleneck and minimal sufficient statistics.