Dissimilarity based feature selection for text classification: a cluster based approach

Authors:
S. Manjunath;B. S. Harish;D. S. Guru
Affiliations:
University of Mysore, Manasagangotri, Mysore, India;University of Mysore, Manasagangotri, Mysore, India;University of Mysore, Manasagangotri, Mysore, India
Venue:
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Year:
2011

Citing 11
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Enhanced word clustering for hierarchical text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
CBC: Clustering Based Text Classification Requiring Minimal Labeled Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Two-dimensional clustering for text categorization

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Regularized locality preserving indexing via spectral regression

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine

IEEE Transactions on Knowledge and Data Engineering
Rapid and brief communication: Multivalued type dissimilarity measure and concept of mutual dissimilarity value for clustering symbolic patterns

Pattern Recognition
Symbolic representation of text documents

Proceedings of the Third Annual ACM Bangalore Conference
Comparing dimension reduction techniques for document clustering

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a simple and efficient symbolic text classification is presented. We propose a new method of representing documents based on clustering of term frequency vectors. For each class of documents we propose to create multiple clusters to preserve the intraclass variations. Term frequency vectors of each cluster are used to form a symbolic representation by the use of interval valued features. Subsequently, a new feature selection method based on a new dissimilarity measure is also presented. The new feature selection method reduces the features in the representation phase for effective text classification. It keeps the best features for effective representation and simultaneously reduces the time taken to classify a given document. To corroborate the efficacy of the proposed model we conducted experimentation on various datasets. Experimental results reveal that the proposed method gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.