Finding uninformative features in binary data

Authors:
Xin Wang;Ata Kabán
Affiliations:
School of Computer Science, The University of Birmingham, Birmingham, UK;School of Computer Science, The University of Birmingham, Birmingham, UK
Venue:
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Year:
2005

Citing 8
Cited 4

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Matching words and pictures

The Journal of Machine Learning Research
Experiments with random projections for machine learning

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Feature Selection for Unsupervised Learning

The Journal of Machine Learning Research
Simultaneous Feature Selection and Clustering Using Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
The cluster-abstraction model: unsupervised learning of topic hierarchies from text data

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Model-based hierarchical clustering

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

A Statistical Approach for Binary Vectors Modeling and Clustering

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
On multivariate binary data clustering and feature weighting

Computational Statistics & Data Analysis
Robust analysis of MRS brain tumour data using t-GTM

Neurocomputing
Model-Based estimation of word saliency in text

DS'06 Proceedings of the 9th international conference on Discovery Science

Quantified Score

Hi-index	0.01

Visualization

Abstract

For statistical modelling of multivariate binary data, such as text documents, datum instances are typically represented as vectors over a global vocabulary of attributes. Apart from the issue of high dimensionality, this also faces us with the problem of uneven importance of various attribute presences/absences. This problem has been largely overlooked in the literature, however it may create difficulties in obtaining reliable estimates of unsupervised probabilistic representation models. In turn, the problem of automated feature selection and feature weighting in the context of unsupervised learning is challenging, because there is no known target to guide the search. In this paper we propose and study a relatively simple cluster-based generative model for multivariate binary data, equipped with automated feature weighting capability. Empirical results on both synthetic and real data sets are given and discussed.