Class imbalance methods for translation initiation site recognition in DNA sequences

  • Authors:
  • Nicolás García-Pedrajas;Javier Pérez-Rodríguez;María García-Pedrajas;Domingo Ortiz-Boyer;Colin Fyfe

  • Affiliations:
  • Department of Computing and Numerical Analysis of the University of Córdoba, Campus Universitario de Rabanales, 14071 Córdoba, Spain;Department of Computing and Numerical Analysis of the University of Córdoba, Campus Universitario de Rabanales, 14071 Córdoba, Spain;Instituto de Hortofruticultura Subtropical y Mediterránea "La Mayora", Universidad de Málaga-Consejo Superior de Investigaciones Científicas (IHSM-UMA-CSIC), Estación Experimen ...;Department of Computing and Numerical Analysis of the University of Córdoba, Campus Universitario de Rabanales, 14071 Córdoba, Spain;School of Computing, University of the West of Scotland, Paisley PA1 2BE, United Kingdom

  • Venue:
  • Knowledge-Based Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Translation initiation site (TIS) recognition is one of the first steps in gene structure prediction, and one of the common components in any gene recognition system. Many methods have been described in the literature to identify TIS in transcribed sequences such as mRNA, EST and cDNA sequences. However, the recognition of TIS in DNA sequences is a far more challenging task, and the methods described so far for transcripts achieve poor results in DNA sequences. Most methods approach this problem taking into account its biological characteristics. In this work we try a different view, considering this classification problem from a purely machine learning perspective. From the point of view of machine learning, TIS recognition is a class imbalance problem. Thus, in this paper we approach TIS recognition from this angle, and apply the different methods that have been developed to deal with imbalanced datasets. The proposed approach has two advantages. Firstly, it improves the results using standard classification methods. Secondly, it broadens the set of classification algorithms that can be used, as some of the class-imbalance methods, such as undersampling, are also useful as methods for scaling up data mining algorithms as they reduce the size of the dataset. In this way, classifiers that cannot be applied to the whole dataset, due to long training time or large memory requirements, can be used when undersampling method is applied. Results show an advantage of class imbalance methods with respect to the same methods applied without considering the class imbalance nature of the problem. The applied methods are also able to improve the results obtained with the best method in the literature, which is based on looking for the next in-frame stop codon from the putative TIS that must be predicted.