Mixed-sampling approach to unbalanced data distributions: a case study involving Leukemia's document profiling

Authors:
Wu QingQiang;Liu Hua;Liu KunHong
Affiliations:
School of Software, Xiamen University, Xiamen, Fujian Province, P. R. China;Information resource center, Institute of Scientific and Technical Information of China, Beijing, P. R. China;School of Software, Xiamen University, Xiamen, Fujian Province, P. R. China
Venue:
WSEAS Transactions on Information Science and Applications
Year:
2011

Citing 23
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Decision Tree Induction Based on Efficient Tree Restructuring

Machine Learning
Noisy replication in skewed binary classification

Computational Statistics & Data Analysis
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Machine Learning
Ensemble Methods in Machine Learning

MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
Improving Identification of Difficult Small Classes by Balancing Class Distribution

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Gene symbol disambiguation using knowledge-based profiles

Bioinformatics
Cancer classification using Rotation Forest

Computers in Biology and Medicine
A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems

Expert Systems with Applications: An International Journal
Cluster-based under-sampling approaches for imbalanced data distributions

Expert Systems with Applications: An International Journal
Ensemble component selection for improving ICA based microarray data prediction models

Pattern Recognition
A genetic programming-based approach to the classification of multiclass microarray datasets

Bioinformatics
A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling

MCS '09 Proceedings of the 8th International Workshop on Multiple Classifier Systems
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Microarray data classification based on ensemble independent component selection

Computers in Biology and Medicine
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A novel ensemble machine learning for robust microarray data classification

Computers in Biology and Medicine
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition
Co-word analysis of the trends in stem cells field based on subject heading weighting

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Leukemia's types and their relationships to literatures are introduced, based on which data set about Leukemia for classification is constructed with original data sources, such as Cancer Gene Census, PubMed and gene2pubmed. The data set is imbalanced as the research object. Based on the introduction of current classification methods of imbalanced data set, the problems of sampling in imbalanced data set are analyzed, and mixed-sampling method is proposed to classify the Leukemia data set. The multi-class problem about Leukemia is transferred to a set of two-class problems. Area Under Receiver Operating Characteristic (ROC) Curve (AUC) are used to evaluate the mixed-sampling method. Then, experiments are performed to verify the classification efficiency and stability of eight classification methods, and their classification results are comparatively analyzed. It can be found that the mixed-sampling method achieves the best performance. At last, the research work in this paper is concluded with a look forward to the future work.