Feature selection for translation initiation site recognition

Authors:
Aida De Haro-García;Javier Pérez-Rodríguez;Nicolás García-Pedrajas
Affiliations:
Department of Computing and Numerical Analysis, University of Cóprdoba, Spain;Department of Computing and Numerical Analysis, University of Cóprdoba, Spain;Department of Computing and Numerical Analysis, University of Cóprdoba, Spain
Venue:
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part II
Year:
2011

Citing 6
Cited 0

Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
A Branch and Bound Algorithm for Feature Subset Selection

IEEE Transactions on Computers
Translation initiation site prediction on a genomic scale

Bioinformatics
Class imbalance methods for translation initiation site recognition

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Translation initiation site (TIS) recognition is one of the first steps in gene structure prediction, and one of the common components in any gene recognition system. Many methods have been described in the literature to identify TIS in transcripts such as mRNA, EST and cDNA sequences. However, the recognition of TIS in DNA sequences is a far more challenging task, and the methods described so far for transcripts achieve poor results in DNA sequences. From the point of view of Machine Learning, this problem has two distinguishing characteristics: it is class imbalanced and has many features. In this work, we deal with the latter of these two characteristics. We present a study of the relevance of the different features, the nucleotides that form the sequences, used for recognizing TIS by means of feature selection techniques. We found that the importance of each base position depends on the type of organism. The feature selection process is used to obtain a subset of features for the sequence which is able to improve the classification accuracy of the recognizer. Our results using sequences from human genome, Arabidopsis thaliana and Ustilago maydis show the usefulness of the proposed approach.