Biological sequences encoding for supervised classification

Authors:
Rabie Saidi;Mondher Maddouri;Engelbert Mephu Nguifo
Affiliations:
CRIL-CNRS, Université d'Artois, IUT de Lens, France and FSJEG, University of Jendouba, Tunisia;Computer Science Department, National Institute of Applied Sciences and Technologies, Tunisia;CRIL-CNRS, Université d'Artois, IUT de Lens, France
Venue:
BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Year:
2007

Citing 5
Cited 1

Data mining: concepts and techniques

Data mining: concepts and techniques
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
Encoding of primary structures of biological macromolecules within a data mining perspective

Journal of Computer Science and Technology - Special issue on bioinformatics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Comparing graph-based representations of protein for mining purposes

Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The classification of biological sequences is one of the significant challenges in bioinformatics as well for protein as for nucleic sequences. The presence of these data in huge masses, their ambiguity and especially the high costs of the in vitro analysis in terms of time and money, make the use of data mining rather a necessity than a rational choice. However, the data mining techniques, which often process data under the relational format, are confronted with the inappropriate format of the biological sequences. Hence, an inevitable step of pre-processing must be established. This work presents the biological sequences encoding as a preparation step before their classification. We present three existing encoding methods based on the motifs extraction. We also propose to improve one of these methods and we carry out a comparative study which takes into account, of course, the effect of each method on the classification accuracy but also the number of generated attributes and the CPU time.