Filtering Methods for Feature Selection in Web-Document Clustering

  • Authors:
  • Heum Park;Hyuk-Chul Kwon

  • Affiliations:
  • AI Lab. Dept. of Computer Science, Pusan National University, Busan, Korea;AI Lab. Dept. of Computer Science, Pusan National University, Busan, Korea

  • Venue:
  • ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents the results of a comparative study of filtering methods for feature selection in web document clustering. First, we focused on feature selection methods based on Mutual Information (MI) and Information Gain (IG). With those features and feature values, and using MI and IG, we extracted from documents representative max-value features as well as a representative cluster for a feature and a representative cluster for a document. Second, we tested the Max Feature Selection Method (MFSM) with those representative features and clusters, and evaluated the web-document clustering performance. However, when document sets yield poor clustering results by term frequency, we cannot obtain good features using the MFSM with the MI and IG values. Therefore, we propose new filtering methods, Min Count of Representative Cluster for a Feature (MCRCF) and Min Count of Representative Cluster for a Document (MCRCD). In the experimental results, the MFSM showed better performance than was achieved using only term frequency, MI and IG. And when we applied the new filtering methods for feature selection (MCRCF, MCRCD), the clustering performance improved notably. Thus we can assert that those filtering methods are effective means of feature selection and offer good performance in web document clustering.