Filtering Methods for Feature Selection in Web-Document Clustering

Authors:
Heum Park;Hyuk-Chul Kwon
Affiliations:
AI Lab. Dept. of Computer Science, Pusan National University, Busan, Korea;AI Lab. Dept. of Computer Science, Pusan National University, Busan, Korea
Venue:
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Year:
2007

Citing 3
Cited 0

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Feature selection with conditional mutual information maximin in text categorization

Proceedings of the thirteenth ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the results of a comparative study of filtering methods for feature selection in web document clustering. First, we focused on feature selection methods based on Mutual Information (MI) and Information Gain (IG). With those features and feature values, and using MI and IG, we extracted from documents representative max-value features as well as a representative cluster for a feature and a representative cluster for a document. Second, we tested the Max Feature Selection Method (MFSM) with those representative features and clusters, and evaluated the web-document clustering performance. However, when document sets yield poor clustering results by term frequency, we cannot obtain good features using the MFSM with the MI and IG values. Therefore, we propose new filtering methods, Min Count of Representative Cluster for a Feature (MCRCF) and Min Count of Representative Cluster for a Document (MCRCD). In the experimental results, the MFSM showed better performance than was achieved using only term frequency, MI and IG. And when we applied the new filtering methods for feature selection (MCRCF, MCRCD), the clustering performance improved notably. Thus we can assert that those filtering methods are effective means of feature selection and offer good performance in web document clustering.