A learning method for the class imbalance problem with medical data sets

Authors:
Der-Chiang Li;Chiao-Wen Liu;Susan C. Hu
Affiliations:
Department of Industrial and Information Management, National Cheng Kung University, 1, University Road, Tainan, Taiwan 70101, ROC;Department of Industrial and Information Management, National Cheng Kung University, 1, University Road, Tainan, Taiwan 70101, ROC;Department of Public Health, College of Medicine, National Cheng Kung University, 1, University Road, Tainan, Taiwan 70101, ROC
Venue:
Computers in Biology and Medicine
Year:
2010

Citing 10
Cited 7

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Support-Vector Networks

Machine Learning
Off-Line, Handwritten Numeral Recognition by Perturbation Method

IEEE Transactions on Pattern Analysis and Machine Intelligence
Improving support vector machine classifiers by modifying kernal functions

Neural Networks
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A decision support system based on support vector machines for diagnosis of the heart valve diseases

Computers in Biology and Medicine
A survey of kernel and spectral methods for clustering

Pattern Recognition
Analysis of EEG signals by combining eigenvector methods and multiclass support vector machines

Computers in Biology and Medicine
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Improving the performance of the RBF neural networks trained with imbalanced samples

IWANN'07 Proceedings of the 9th international work conference on Artificial neural networks

Structure activity relationship analysis of phenolic acid phenethyl esters on oral and human breast cancers: The grey GM(0, N) approach

Computers in Biology and Medicine
A learning strategy for highly imbalanced classification

Proceedings of the Third International Conference on Internet Multimedia Computing and Service
Dual support vector domain description for imbalanced classification

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part I
Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction

Knowledge-Based Systems
A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition

Expert Systems with Applications: An International Journal
Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines

Computer Methods and Programs in Biomedicine
A biological continuum based approach for efficient clinical classification

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In medical data sets, data are predominately composed of ''normal'' samples with only a small percentage of ''abnormal'' ones, leading to the so-called class imbalance problems. In class imbalance problems, inputting all the data into the classifier to build up the learning model will usually lead a learning bias to the majority class. To deal with this, this paper uses a strategy which over-samples the minority class and under-samples the majority one to balance the data sets. For the majority class, this paper builds up the Gaussian type fuzzy membership function and @a-cut to reduce the data size; for the minority class, we use the mega-trend diffusion membership function to generate virtual samples for the class. Furthermore, after balancing the data size of classes, this paper extends the data attribute dimension into a higher dimension space using classification related information to enhance the classification accuracy. Two medical data sets, Pima Indians' diabetes and the BUPA liver disorders, are employed to illustrate the approach presented in this paper. The results indicate that the proposed method has better classification performance than SVM, C4.5 decision tree and two other studies.