Entropy based feature selection for text categorization

Authors:
Christine Largeron;Christophe Moulin;Mathias Géry
Affiliations:
Université de Lyon, Saint-Étienne, France;Université de Lyon, Saint-Étienne, France;Université de Lyon, Saint-Étienne, France
Venue:
Proceedings of the 2011 ACM Symposium on Applied Computing
Year:
2011

Citing 19
Cited 1

The nature of statistical learning theory

The nature of statistical learning theory
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Applying an existing machine learning algorithm to text categorization

Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
The Wikipedia XML corpus

ACM SIGIR Forum
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
An examination of feature selection frameworks in text categorization

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

The Effect of Stemming on Arabic Text Classification: An Empirical Study

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In text categorization, feature selection can be essential not only for reducing the index size but also for improving the performance of the classifier. In this article, we propose a feature selection criterion, called Entropy based Category Coverage Difference (ECCD). On the one hand, this criterion is based on the distribution of the documents containing the term in the categories, but on the other hand, it takes into account its entropy. ECCD compares favorably with usual feature selection methods based on document frequency (DF), information gain (IG), mutual information (IM), χ2, odd ratio and GSS on a large collection of XML documents from Wikipedia encyclopedia. Moreover, this comparative study confirms the effectiveness of selection feature techniques derived from the χ2 statistics.