Feature selection based on term frequency and T-test for text categorization

Authors:
Deqing Wang;Hui Zhang;Rui Liu;Weifeng Lv
Affiliations:
Beihang University, Beijing, China;Beihang University, Beijing, China;Beihang University, Beijing, China;Beihang University, Beijing, China
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 12
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Support-Vector Networks

Machine Learning
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
A framework of feature selection methods for text categorization

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on t-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., chi2, and IG) in terms of macro-F1 and micro-F1