Elements of information theory
Elements of information theory
Efficient noise-tolerant learning from statistical queries
STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
A maximum entropy approach to natural language processing
Computational Linguistics
Inducing Features of Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence
Features as sufficient statistics
NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Multivariate Information Bottleneck
UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
A comparison of algorithms for maximum entropy parameter estimation
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Minimax Entropy Principle and Its Application to Texture Modeling
Neural Computation
Reliable communication under channel uncertainty
IEEE Transactions on Information Theory
An introduction to variable and feature selection
The Journal of Machine Learning Research
Document clustering via adaptive subspace iteration
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Margin based feature selection - theory and algorithms
ICML '04 Proceedings of the twenty-first international conference on Machine learning
The minimum information principle for discriminative learning
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Information Bottleneck for Gaussian Variables
The Journal of Machine Learning Research
Supervised dimensionality reduction using mixture models
ICML '05 Proceedings of the 22nd international conference on Machine learning
Exploiting parallelism to support scalable hierarchical clustering
Journal of the American Society for Information Science and Technology
Regression on manifolds using kernel dimension reduction
Proceedings of the 24th international conference on Machine learning
Automatic feature selection for anomaly detection
Proceedings of the 1st ACM workshop on Workshop on AISec
Pattern Recognition Letters
Noise reduction in LSA-based essay assessment
SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization
Applying latent dirichlet allocation to automatic essay grading
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Sufficient dimensionality reduction with irrelevance statistics
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Feature selection for dimensionality reduction
SLSFS'05 Proceedings of the 2005 international conference on Subspace, Latent Structure and Feature Selection
Making sense of healthcare benefits
Proceedings of the 34th International Conference on Software Engineering
Hi-index | 0.00 |
Dimensionality reduction of empirical co-occurrence data is a fundamental problem in unsupervised learning. It is also a well studied problem in statistics known as the analysis of cross-classified data. One principled approach to this problem is to represent the data in low dimension with minimal loss of (mutual) information contained in the original data. In this paper we introduce an information theoretic nonlinear method for finding such a most informative dimension reduction. In contrast with previously introduced clustering based approaches, here we extract continuous feature functions directly from the co-occurrence matrix. In a sense, we automatically extract functions of the variables that serve as approximate sufficient statistics for a sample of one variable about the other one. Our method is different from dimensionality reduction methods which are based on a specific, sometimes arbitrary, metric or embedding. Another interpretation of our method is as generalized - multi-dimensional - non-linear regression, where rather than fitting one regression function through two dimensional data, we extract d-regression functions whose expectation values capture the information among the variables. It thus presents a new learning paradigm that unifies aspects from both supervised and unsupervised learning. The resulting dimension reduction can be described by two conjugate d-dimensional differential manifolds that are coupled through Maximum Entropy I-projections. The Riemannian metrics of these manifolds are determined by the observed expectation values of our extracted features. Following this geometric interpretation we present an iterative information projection algorithm for finding such features and prove its convergence. Our algorithm is similar to the method of "association analysis" in statistics, though the feature extraction context as well as the information theoretic and geometric interpretation are new. The algorithm is illustrated by various synthetic co-occurrence data. It is then demonstrated for text categorization and information retrieval and proves effective in selecting a small set of features, often improving performance over the original feature set.