Biological sequences encoding for supervised classification

  • Authors:
  • Rabie Saidi;Mondher Maddouri;Engelbert Mephu Nguifo

  • Affiliations:
  • CRIL-CNRS, Université d'Artois, IUT de Lens, France and FSJEG, University of Jendouba, Tunisia;Computer Science Department, National Institute of Applied Sciences and Technologies, Tunisia;CRIL-CNRS, Université d'Artois, IUT de Lens, France

  • Venue:
  • BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The classification of biological sequences is one of the significant challenges in bioinformatics as well for protein as for nucleic sequences. The presence of these data in huge masses, their ambiguity and especially the high costs of the in vitro analysis in terms of time and money, make the use of data mining rather a necessity than a rational choice. However, the data mining techniques, which often process data under the relational format, are confronted with the inappropriate format of the biological sequences. Hence, an inevitable step of pre-processing must be established. This work presents the biological sequences encoding as a preparation step before their classification. We present three existing encoding methods based on the motifs extraction. We also propose to improve one of these methods and we carry out a comparative study which takes into account, of course, the effect of each method on the classification accuracy but also the number of generated attributes and the CPU time.