Multitype Features Coselection for Web Document Clustering

Authors:
Shen Huang;Zheng Chen;Yong Yu;Wei-Ying Ma
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 13
Cited 4

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Elements of information theory

Elements of information theory
HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Proceedings of the the seventh ACM conference on Hypertext
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
ReCoM: reinforcement clustering of multi-type interrelated data objects

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Metasearch: data fusion for document retrieval

Metasearch: data fusion for document retrieval
Web document clustering using hyperlink structures

Computational Statistics & Data Analysis

Synonymous Chinese Transliterations Retrieval from World Wide Web by Using Association Words

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Managing email overload with an automatic nonparametric clustering system

The Journal of Supercomputing
Mining Synonymous Transliterations from the World Wide Web

ACM Transactions on Asian Language Information Processing (TALIP)
Probability-based text clustering algorithm by alternately repeating two operations

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection has been widely applied in text categorization and clustering. Compared to unsupervised selection, supervised feature selection is more successful in filtering out noise in most cases. However, due to a lack of label information, clustering can hardly exploit supervised selection. Some studies have proposed to solve this problem by "pseudoclass.” As empirical results show, this method is sensitive to selection criteria and data sets. In this paper, we propose a novel feature coselection for Web document clustering, which is called Multitype Features Coselection for Clustering (MFCC). MFCC uses intermediate clustering results in one type of feature space to help the selection in other types of feature spaces. Our experiments show that for most selection criteria, MFCC reduces effectively the noise introduced by "pseudoclass,” and further improves clustering performance.