Information gain and divergence-based feature selection for machine learning-based text categorization

Authors:
Changki Lee;Gary Geunbae Lee
Affiliations:
Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31 Hyoja dong, Nam Gu, Pohang 790-784, Korea (South);Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31 Hyoja dong, Nam Gu, Pohang 790-784, Korea (South)
Venue:
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Year:
2006

Citing 9
Cited 22

Some inconsistencies and misnomers in probabilistic information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using machine learning to improve information access

Using machine learning to improve information access

Positive approximation: An accelerator for attribute reduction in rough set theory

Artificial Intelligence
Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches

Decision Support Systems
Autonomous rule induction from data with tolerances in customer relationship management

Expert Systems with Applications: An International Journal
An efficient accelerator for attribute reduction from incomplete data in rough set framework

Pattern Recognition
Using thesaurus to improve multiclass text classification

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Knowledge-Based Systems
Feature selection for unlabeled data

ICSI'11 Proceedings of the Second international conference on Advances in swarm intelligence - Volume Part II
Feature selection for support vector machines with RBF kernel

Artificial Intelligence Review
Class-driven correlation learning for chinese document categorization using discriminative features

Proceedings of the Third International Conference on Internet Multimedia Computing and Service
Comparison of term frequency and document frequency based feature selection metrics in text categorization

Expert Systems with Applications: An International Journal
Multi objective genetic programming for feature construction in classification problems

LION'05 Proceedings of the 5th international conference on Learning and Intelligent Optimization
Feature evaluation and selection with cooperative game theory

Pattern Recognition
Large-margin feature selection for monotonic classification

Knowledge-Based Systems
An efficient rough feature selection algorithm with a multi-granulation view

International Journal of Approximate Reasoning
An adaption of relief for redundant feature elimination

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part II
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
Fast dimension reduction for document classification based on imprecise spectrum analysis

Information Sciences: an International Journal
Feature selection using dynamic weights for classification

Knowledge-Based Systems
An accelerator for attribute reduction based on perspective of objects and attributes

Knowledge-Based Systems
Selection of interdependent genes via dynamic relevance analysis for cancer diagnosis

Journal of Biomedical Informatics
Class-indexing-based term weighting for automatic text classification

Information Sciences: an International Journal
The impact of preprocessing on text classification

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami's method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.