Categorization of large text collections: feature selection for training neural networks

Authors:
Pensiri Manomaisupat;Bogdan Vrusias;Khurshid Ahmad
Affiliations:
Department of Computing, University of Surrey, Guildford, Surrey, UK;Department of Computing, University of Surrey, Guildford, Surrey, UK;Department of Computer Science, O’reilly Institute, Trinity College, Dublin 2, Ireland
Venue:
IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Year:
2006

Citing 14
Cited 1

Term clustering of syntactic phrases

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Self-Organizing Maps

Self-Organizing Maps
Support Vector Machines

IEEE Intelligent Systems
Asymptotic behaviors of support vector machines with Gaussian kernel

Neural Computation
A Dynamic Adaptive Self-Organising Hybrid Model for Text Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Evaluating Keyword Selection Methods for WEBSOM Text Archives

IEEE Transactions on Knowledge and Data Engineering
Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet

IEEE Intelligent Systems
Improving Automatic Query Classification via Semi-Supervised Learning

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Self organization of a massive document collection

IEEE Transactions on Neural Networks
Survey of clustering algorithms

IEEE Transactions on Neural Networks

A Hierarchical Self-organised Classification of `Multinational' Corporations

IDEAL '08 Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: supervised Kohonen’s SOFM and unsupervised Vapniak’s SVM. The methods are tested using two ‘gold standard’ document collections and one data set from a ‘real-world’ news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus.