A combined approach to tackle imbalanced data sets

  • Authors:
  • B. K. Sarkar;S. S. Sana;K. S. Chaudhuri

  • Affiliations:
  • Department of Information Technology, Birla Institute of Technology, Mesra, Ranchi, India;Department of Mathematics, Bhangar Mahavidyalaya C.U., Bhangar, India;Department of Mathematics, Jadavpur University, Kolkata, India

  • Venue:
  • International Journal of Hybrid Intelligent Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Learning with imbalanced data causes high error-rates. Several approaches have been developed for addressing this problem. In this paper, a new learning model, integrating the C4.5 classifier and evolutionary algorithms, is introduced. To strengthen the model, two separate partitioning data sets are chosen for each original data set, by applying two distinct partitioning schemes proposed in this investigation, and these are used in sequence by the learning model. More specifically, the hybrid system first applies the base method C4.5 to produce a set of rules R from a training set say T_1, as constructed by the first data partitioning scheme. The R is then passed to the Genetic Algorithm to discover another set of rules say R_{GA} from another disjoint training set say T_2. T_2 is decided by the proposed second partitioning method. Finally, some informative rules of R_{GA} are included into R. The presented system is tested on several real data sets collected from the UCI machine learning repository and compared with standard C4.5. Experimental results show the good suitability of the system on imbalanced data sets. However, the model does not show negative performance on balanced data sets too.