Mixed-sampling approach to unbalanced data distributions: a case study involving Leukemia's document profiling

  • Authors:
  • Wu QingQiang;Liu Hua;Liu KunHong

  • Affiliations:
  • School of Software, Xiamen University, Xiamen, Fujian Province, P. R. China;Information resource center, Institute of Scientific and Technical Information of China, Beijing, P. R. China;School of Software, Xiamen University, Xiamen, Fujian Province, P. R. China

  • Venue:
  • WSEAS Transactions on Information Science and Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Leukemia's types and their relationships to literatures are introduced, based on which data set about Leukemia for classification is constructed with original data sources, such as Cancer Gene Census, PubMed and gene2pubmed. The data set is imbalanced as the research object. Based on the introduction of current classification methods of imbalanced data set, the problems of sampling in imbalanced data set are analyzed, and mixed-sampling method is proposed to classify the Leukemia data set. The multi-class problem about Leukemia is transferred to a set of two-class problems. Area Under Receiver Operating Characteristic (ROC) Curve (AUC) are used to evaluate the mixed-sampling method. Then, experiments are performed to verify the classification efficiency and stability of eight classification methods, and their classification results are comparatively analyzed. It can be found that the mixed-sampling method achieves the best performance. At last, the research work in this paper is concluded with a look forward to the future work.