Combined classifier for unknown genome classification using chaos game representation features

  • Authors:
  • Vrinda V. Nair;Achuthsankar S. Nair

  • Affiliations:
  • University of Kerala, Thiruvananthapuram, Kerala;University of Kerala, Thiruvananthapuram

  • Venue:
  • ISB '10 Proceedings of the International Symposium on Biocomputing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classification of unknown genomes finds wide application in areas like evolutionary studies, bio-diversity researches and forensic studies which are viewed in a renewed 'genomic' perspective, lately. Only a few attempts are seen in literature focusing on unknown genome identification, and the reported accuracies are not more than 85%. Most works report classification into the major kingdoms only, not venturing further into their sub-classes. A novel combined technique of Chaos Game Representation (CGR) and machine learning is proposed, the former for feature extraction and the latter for subsequent sequence classification. Eight sub categories of eukaryotic mitochondrial genomes from NCBI are used for the study. The sequences are initially mapped into their Chaos Game Representation format. Genomic feature extraction is implemented by computing the Frequency Chaos Game Representation (FCGR) matrix. An order 3 FCGR matrix is considered here, which consists of 64 elements. The 64 element matrix acts as the feature descriptor for classification. The classification methods used are Difference Boosting Naïve Bayesian (DBNB) based method, Artificial Neural Network (ANN) based and Support Vector Machine (SVM) based methods. Accuracies of individual methods are reported. Although the average accuracy is seen highest for the SVM-CGR combination, better accuracies are seen for some categories in other methods too. Hence a voting classifier is implemented combining all the three methods. Accuracies of 100% were obtained for Vertebrata and Porifera whereas Acoelomata, Cnidaria and Fungi were classified with accuracies above 90%. The accuracies obtained for Protostomia, Plant, and Pseudocoelomata were respectively 90, 82 and 77%.