Comparison of term frequency and document frequency based feature selection metrics in text categorization

Authors:
Nouman Azam;JingTao Yao
Affiliations:
Department of Computer Science, University of Regina, Regina, SK, Canada S4S 0A2;Department of Computer Science, University of Regina, Regina, SK, Canada S4S 0A2
Venue:
Expert Systems with Applications: An International Journal
Year:
2012

Citing 24
Cited 3

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing (TALIP)
Toward Integrating Feature Selection Algorithms for Classification and Clustering

IEEE Transactions on Knowledge and Data Engineering
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
A review of feature selection techniques in bioinformatics

Bioinformatics
User-Oriented Feature Selection for Machine Learning

The Computer Journal
Integrating the voice of customers through call center emails into a decision support system for churn prediction

Information and Management
Attribute reduction in decision-theoretic rough set models

Information Sciences: an International Journal
Two novel feature selection approaches for web page classification

Expert Systems with Applications: An International Journal
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Feature selection with a measure of deviations from Poisson in text categorization

Expert Systems with Applications: An International Journal
A Competitive Term Selection Method for Information Retrieval

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Distinctive characteristics of a metric using deviations from Poisson for feature selection

Expert Systems with Applications: An International Journal
Information gain and divergence-based feature selection for machine learning-based text categorization

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
A feature selection method based on improved fisher's discriminant ratio for text sentiment classification

Expert Systems with Applications: An International Journal
Incorporating game theory in feature selection for text categorization

RSFDGrC'11 Proceedings of the 13th international conference on Rough sets, fuzzy sets, data mining and granular computing
Application of text categorization to astronomy field

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Feature selection with adjustable criteria

RSFDGrC'05 Proceedings of the 10th international conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing - Volume Part I
Enhancement of DTP feature selection method for text categorization

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
A three-way decision approach to email spam filtering

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence

Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches

Expert Systems with Applications: An International Journal
Unsupervised topic detection model and its application in text categorization

Proceedings of the CUBE International Information Technology Conference
Analyzing uncertainties of probabilistic rough set regions with game-theoretic rough sets

International Journal of Approximate Reasoning

Quantified Score

Hi-index	12.05

Visualization

Abstract

Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets.