Exploiting word cluster information for unsupervised feature selection

Authors:
Qingyao Wu;Yunming Ye;Michael Ng;Hanjing Su;Joshua Huang
Affiliations:
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China;Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China;Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong;Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China;Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
Venue:
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Year:
2010

Citing 11
Cited 1

Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Maximizing Text-Mining Performance

IEEE Intelligent Systems
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Co-clustering based classification for out-of-domain documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Dragon Toolkit: Incorporating Auto-Learned Semantic Knowledge into Large-Scale Text Retrieval and Mining

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02

Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an approach to integrate word clustering information into the process of unsupervised feature selection. In our scheme, the words in the whole feature space are clustered into groups based on the co-occurrence statistics of words. The resulted word clustering information and the bag-of-word information are combined together to measure the goodness of each word, which is our basic metric for selecting discriminative features. By exploiting word cluster information, we extend three well-known unsupervised feature selection methods and propose three new methods. A series of experiments are performed on three benchmark text data sets (the 20 Newsgroups, Reuters-21578 and CLASSIC3). The experimental results have shown that the new unsupervised feature selection methods can select more discriminative features, and in turn improve the clustering performance.