A Bayesian feature selection paradigm for text classification

Authors:
Guozhong Feng;Jianhua Guo;Bing-Yi Jing;Lizhu Hao
Affiliations:
Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun 130024, Jilin Province, China and School of Mathematics and Statistics, Northeast Normal University, Changchun ...;Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun 130024, Jilin Province, China and School of Mathematics and Statistics, Northeast Normal University, Changchun ...;Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong;Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun 130024, Jilin Province, China and School of Mathematics and Statistics, Northeast Normal University, Changchun ...
Venue:
Information Processing and Management: an International Journal
Year:
2012

Citing 5
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Combination of modified BPNN algorithms and an efficient feature selection method for text categorization

Information Processing and Management: an International Journal

The impact of preprocessing on text classification

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.