Exploiting probabilistic topic models to improve text categorization under class imbalance

Authors:
Enhong Chen;Yanggang Lin;Hui Xiong;Qiming Luo;Haiping Ma
Affiliations:
School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;Department of Management Science and Information Systems, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901-8554, USA;School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China
Venue:
Information Processing and Management: an International Journal
Year:
2011

Citing 17
Cited 2

WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Latent dirichlet allocation

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Data Mining and Knowledge Discovery Handbook

Data Mining and Knowledge Discovery Handbook
Contextual feature selection for text classification

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Local decomposition for rare class analysis

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
The class imbalance problem: A systematic study

Intelligent Data Analysis
Text classification from unlabeled documents with bootstrapping and feature projection techniques

Information Processing and Management: an International Journal
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Neighbor-weighted K-nearest neighbor for unbalanced text corpus

Expert Systems with Applications: An International Journal
The condensed nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory

The decomposed k-nearest neighbor algorithm for imbalanced text classification

FGIT'12 Proceedings of the 4th international conference on Future Generation Information Technology
An Ontology-Based Mining of Consumer Feedbacks Using Fuzzy Reasoning

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.