Dissimilarity based feature selection for text classification: a cluster based approach

  • Authors:
  • S. Manjunath;B. S. Harish;D. S. Guru

  • Affiliations:
  • University of Mysore, Manasagangotri, Mysore, India;University of Mysore, Manasagangotri, Mysore, India;University of Mysore, Manasagangotri, Mysore, India

  • Venue:
  • Proceedings of the International Conference & Workshop on Emerging Trends in Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, a simple and efficient symbolic text classification is presented. We propose a new method of representing documents based on clustering of term frequency vectors. For each class of documents we propose to create multiple clusters to preserve the intraclass variations. Term frequency vectors of each cluster are used to form a symbolic representation by the use of interval valued features. Subsequently, a new feature selection method based on a new dissimilarity measure is also presented. The new feature selection method reduces the features in the representation phase for effective text classification. It keeps the best features for effective representation and simultaneously reduces the time taken to classify a given document. To corroborate the efficacy of the proposed model we conducted experimentation on various datasets. Experimental results reveal that the proposed method gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.