Feature selection and classification of high dimensional mass spectrometry data: a genetic programming approach

  • Authors:
  • Soha Ahmed;Mengjie Zhang;Lifeng Peng

  • Affiliations:
  • School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Biological Sciences, Victoria University of Wellington, Wellington, New Zealand

  • Venue:
  • EvoBIO'13 Proceedings of the 11th European conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets.