Using the self organizing map for clustering of text documents

  • Authors:
  • Dino Isa;V. P. Kallimani;Lam Hong Lee

  • Affiliations:
  • Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia;Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia;Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2009

Quantified Score

Hi-index 12.07

Visualization

Abstract

An increasing number of computational and statistical approaches have been used for text classification, including nearest-neighbor classification, naive Bayes classification, support vector machines, decision tree induction, rule induction, and artificial neural networks. Among these approaches, naive Bayes classifiers have been widely used because of its simplicity. Due to the simplicity of the Bayes formula, the naive Bayes classification algorithm requires a relatively small number of training data and shorter time in both the training and classification stages as compared to other classifiers. However, a major short coming of this technique is the fact that the classifier will pick the highest probability category as the one to which the document is annotated too. Doing this is tantamount to classifying using only one dimension of a multi-dimensional data set. The main aim of this work is to utilize the strengths of the self organizing map (SOM) to overcome the inadvertent dimensionality reduction resulting from using only the Bayes formula to classify. Combining the hybrid system with new ranking techniques further improves the performance of the proposed document classification approach. This work describes the implementation of an enhanced hybrid classification approach which affords a better classification accuracy through the utilization of two familiar algorithms, the naive Bayes classification algorithm which is used to vectorize the document using a probability distribution and the self organizing map (SOM) clustering algorithm which is used as the multi-dimensional unsupervised classifier.