Cluster based symbolic representation and feature selection for text classification

  • Authors:
  • B. S. Harish;D. S. Guru;S. Manjunath;R. Dinesh

  • Affiliations:
  • Department of Studies in Computer Science, University of Mysore, Mysore, India;Department of Studies in Computer Science, University of Mysore, Mysore, India;Department of Studies in Computer Science, University of Mysore, Mysore, India;Honeywell Technologies Ltd, Bangalore, India

  • Venue:
  • ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a new method of representing documents based on clustering of term frequency vectors. For each class of documents we propose to create multiple clusters to preserve the intraclass variations. Term frequency vectors of each cluster are used to form a symbolic representation by the use of interval valued features. Subsequently we propose a novel symbolic method for feature selection. The corresponding symbolic text classification is also presented. To corroborate the efficacy of the proposed model we conducted an experimentation on various datasets. Experimental results reveal that the proposed method gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.