A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
The Journal of Machine Learning Research
Experiments with random projections for machine learning
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Feature Selection for Unsupervised Learning
The Journal of Machine Learning Research
Simultaneous Feature Selection and Clustering Using Mixture Models
IEEE Transactions on Pattern Analysis and Machine Intelligence
Feature selection and feature extraction for text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
The cluster-abstraction model: unsupervised learning of topic hierarchies from text data
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Model-based hierarchical clustering
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
A Statistical Approach for Binary Vectors Modeling and Clustering
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
On multivariate binary data clustering and feature weighting
Computational Statistics & Data Analysis
Robust analysis of MRS brain tumour data using t-GTM
Neurocomputing
Model-Based estimation of word saliency in text
DS'06 Proceedings of the 9th international conference on Discovery Science
Hi-index | 0.01 |
For statistical modelling of multivariate binary data, such as text documents, datum instances are typically represented as vectors over a global vocabulary of attributes. Apart from the issue of high dimensionality, this also faces us with the problem of uneven importance of various attribute presences/absences. This problem has been largely overlooked in the literature, however it may create difficulties in obtaining reliable estimates of unsupervised probabilistic representation models. In turn, the problem of automated feature selection and feature weighting in the context of unsupervised learning is challenging, because there is no known target to guide the search. In this paper we propose and study a relatively simple cluster-based generative model for multivariate binary data, equipped with automated feature weighting capability. Empirical results on both synthetic and real data sets are given and discussed.