Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics

  • Authors:
  • Yan-Qing Zhang;Zejin Ding

  • Affiliations:
  • Georgia State University;Georgia State University

  • Venue:
  • Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the fast developments in science and technology, massive data sets are generated in an exponential rate. In recent years, many supervised classification methods have shown good performance on balanced data, however, imbalanced data mining is still a new and long-term challenging research area.In this dissertation, we study the problem of how to build efficient ensemble classifier to learn from imbalanced datasets. A formal definition for imbalanced binary classification problem is proposed and several challenging aspects of learning from imbalanced data are discussed. We extensively investigate the current research trends in handling imbalance learning problems to provide a comprehensive overview of representative studies in this area.Our main contribution of this work is to propose a new ensemble framework—Diversified Ensemble Classifiers for Imbalanced Data Learning (DECIDL), based on the advantages of several existing ensemble imbalanced learning strategies. Our strategy combines three popular learning techniques together: a) ensemble learning, b) artificial example generation, and c) diversity construction by using oppositional data re-labeling. As a meta-learner, DECIDL can utilize general supervised learning algorithms, such as support vector machines, decision trees and neural networks, etc., as the base learner to build an effective ensemble committee. A comprehensive benchmark pool is developed to enclose 30 public imbalanced data sets with diversified data characteristics from multiple real applications. All the data sets are highly skewed with imbalance ratio ranging from 10:1 to 100:1, and have never been completely and systematically studied in any work. In this dissertation, we compare the DECIDL framework with several existing ensemble learning frameworks, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost on this benchmark data pool. Extensive experiments with various base learners suggest that our DECIDL framework are comparable with other ensemble methods, in terms of averaged F-measure and MCC performance on 30 data sets with four base learners (decision stumps, decision trees, linear support vector machines, and perceptron neural networks). The data sets, experiments and results provide a complete and valuable knowledge base for any future research on highly imbalanced data learning. Additional experiments are also conducted to verify the DECIDL effectiveness under various technical settings. INDEX WORDS: Machine learning, Classification, Imbalanced data learning, Diversified ensemble, Bioinformatics, Protein methylation