Using the self organizing map for clustering of text documents

Authors:
Dino Isa;V. P. Kallimani;Lam Hong Lee
Affiliations:
Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia;Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia;Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, 43500 Semenyih, Malaysia
Venue:
Expert Systems with Applications: An International Journal
Year:
2009

Citing 12
Cited 15

The nature of statistical learning theory

The nature of statistical learning theory
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Artificial Intelligence: A Guide to Intelligent Systems

Artificial Intelligence: A Guide to Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Effective Methods for Improving Naive Bayes Text Classifiers

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Using sets of feature vectors for similarity search on voxelized CAD objects

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Fast and accurate text classification via multiple linear discriminant projections

The VLDB Journal — The International Journal on Very Large Data Bases
Spam filters: bayes vs. chi-squared; letters vs. words

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Mining rare and frequent events in multi-camera surveillance video using self-organizing maps

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
An Assessment of Case-Based Reasoning for Spam Filtering

Artificial Intelligence Review
A case-based technique for tracking concept drift in spam filtering

Knowledge-Based Systems
Email categorization with tournament methods

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

A new document representation using term frequency and vectorized graph connectionists with application to document retrieval

Expert Systems with Applications: An International Journal
Clustering Indian stock market data for portfolio management

Expert Systems with Applications: An International Journal
Partition-conditional ICA for Bayesian classification of microarray data

Expert Systems with Applications: An International Journal
Automatically computed document dependent weighting factor facility for Naïve Bayes classification

Expert Systems with Applications: An International Journal
A clustering study of a 7000 EU document inventory using MDS and SOM

Expert Systems with Applications: An International Journal
Research of fast SOM clustering for text information

Expert Systems with Applications: An International Journal
A semi-supervised tool for clustering accounting databases with applications to internal controls

Expert Systems with Applications: An International Journal
High Relevance Keyword Extraction facility for Bayesian text classification on different domains of varying characteristic

Expert Systems with Applications: An International Journal
Automatic folder allocation system using Bayesian-support vector machines hybrid classification approach

Applied Intelligence
Fast growing self organizing map for text clustering

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine

Expert Systems with Applications: An International Journal
An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Applied Intelligence
Probability based document clustering and image clustering using content-based image retrieval

Applied Soft Computing
Automated crime report analysis and classification for e-government and decision support

Proceedings of the 14th Annual International Conference on Digital Government Research
Document clustering method using dimension reduction and support vector clustering to overcome sparseness

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.07

Visualization

Abstract

An increasing number of computational and statistical approaches have been used for text classification, including nearest-neighbor classification, naive Bayes classification, support vector machines, decision tree induction, rule induction, and artificial neural networks. Among these approaches, naive Bayes classifiers have been widely used because of its simplicity. Due to the simplicity of the Bayes formula, the naive Bayes classification algorithm requires a relatively small number of training data and shorter time in both the training and classification stages as compared to other classifiers. However, a major short coming of this technique is the fact that the classifier will pick the highest probability category as the one to which the document is annotated too. Doing this is tantamount to classifying using only one dimension of a multi-dimensional data set. The main aim of this work is to utilize the strengths of the self organizing map (SOM) to overcome the inadvertent dimensionality reduction resulting from using only the Bayes formula to classify. Combining the hybrid system with new ranking techniques further improves the performance of the proposed document classification approach. This work describes the implementation of an enhanced hybrid classification approach which affords a better classification accuracy through the utilization of two familiar algorithms, the naive Bayes classification algorithm which is used to vectorize the document using a probability distribution and the self organizing map (SOM) clustering algorithm which is used as the multi-dimensional unsupervised classifier.