A two-stage feature selection method for text categorization

Authors:
Jiana Meng;Hongfei Lin;Yuhai Yu
Affiliations:
College of Science, Dalian Nationalities University, Dalian 116600, China and Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China;Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China and School of Computer Science & Engineering, Dalian Nationalities University, Dalian 116600, ...
Venue:
Computers & Mathematics with Applications
Year:
2011

Citing 13
Cited 3

The nature of statistical learning theory

The nature of statistical learning theory
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Spam!

Communications of the ACM
Making large-scale support vector machine learning practical

Advances in kernel methods
Using LSI for text classification in the presence of background text

Proceedings of the tenth international conference on Information and knowledge management
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Supervised Latent Semantic Indexing for Document Categorization

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Feature selection for text classification with Naïve Bayes

Expert Systems with Applications: An International Journal
Genetic algorithm for text clustering based on latent semantic indexing

Computers & Mathematics with Applications

Text Document Clustering with Hybrid Feature Selection

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection

Knowledge-Based Systems
A hybrid Gini PSO-SVM feature selection based on Taguchi method: an evaluation on email filtering

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.09

Visualization

Abstract

Feature selection for text categorization is a well-studied problem and its goal is to improve the effectiveness of categorization, or the efficiency of computation, or both. The system of text categorization based on traditional term-matching is used to represent the vector space model as a document; however, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which leads to a poor categorization accuracy. The latent semantic indexing method can overcome this problem by using statistically derived conceptual indices to replace the individual terms. With the purpose of improving the accuracy and efficiency of categorization, in this paper we propose a two-stage feature selection method. Firstly, we apply a novel feature selection method to reduce the dimension of terms; and then we construct a new semantic space, between terms, based on the latent semantic indexing method. Through some applications involving the spam database categorization, we find that our two-stage feature selection method performs better.