Comparison of metrics for feature selection in imbalanced text classification

  • Authors:
  • Hiroshi Ogura;Hiromi Amano;Masato Kondo

  • Affiliations:
  • Department of Information Science, Faculty of Arts and Sciences at Fujiyoshida, Showa University, 4562, Kamiyoshida, Fujiyoshida-City, Yamanashi 403-0005, Japan;Department of Information Science, Faculty of Arts and Sciences at Fujiyoshida, Showa University, 4562, Kamiyoshida, Fujiyoshida-City, Yamanashi 403-0005, Japan;Department of Information Science, Faculty of Arts and Sciences at Fujiyoshida, Showa University, 4562, Kamiyoshida, Fujiyoshida-City, Yamanashi 403-0005, Japan

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.05

Visualization

Abstract

Abstract: Class imbalance problems are often encountered in real applications of automatic text classifications especially at the so-called ''one-against-all'' settings and thus handling the problem with satisfactory performance is substantially important. In this paper, we focus our attention on a feature selection scheme for solving this problem and explore the abilities and characteristics of various metrics for feature selection. We examine three different types of metrics; Type-I: @g"P^2 and Gini index, Type-II: @g^2 and information gain and Type-III: signed @g^2 and signed information gain. Type-I and Type-II metrics implicitly combine positive and negative features which indicate the membership and nonmembership of positive class, respectively. Type-III metrics were utilized in the combination framework in which the positive and negative features are explicitly combined and the degree of combination is optimized to improve the performance at imbalanced situations. Our experimental results show that feature selections using Type-I metrics on imbalanced data set achieve the comparable classification performances with those of the combination framework using Type-III metrics and proved to be much more superior to those of Type-II metrics. This result indicates that Type-I metrics serve as more simplified alternative methods for the combination framework. The characteristic behaviors and the performance of each of the used metrics are also investigated closely in terms of the distribution and quality of selected features.