Class imbalance methods for translation initiation site recognition in DNA sequences

Authors:
Nicolás García-Pedrajas;Javier Pérez-Rodríguez;María García-Pedrajas;Domingo Ortiz-Boyer;Colin Fyfe
Affiliations:
Department of Computing and Numerical Analysis of the University of Córdoba, Campus Universitario de Rabanales, 14071 Córdoba, Spain;Department of Computing and Numerical Analysis of the University of Córdoba, Campus Universitario de Rabanales, 14071 Córdoba, Spain;Instituto de Hortofruticultura Subtropical y Mediterránea "La Mayora", Universidad de Málaga-Consejo Superior de Investigaciones Científicas (IHSM-UMA-CSIC), Estación Experimen ...;Department of Computing and Numerical Analysis of the University of Córdoba, Campus Universitario de Rabanales, 14071 Córdoba, Spain;School of Computing, University of the West of Scotland, Paisley PA1 2BE, United Kingdom
Venue:
Knowledge-Based Systems
Year:
2012

Citing 17
Cited 7

C4.5: programs for machine learning

C4.5: programs for machine learning
A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features

Machine Learning
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
Noisy replication in skewed binary classification

Computational Statistics & Data Analysis
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Inference for the Generalization Error

Machine Learning
New Support Vector Algorithms

Neural Computation
A cooperative constructive method for neural networks for pattern recognition

Pattern Recognition
Cost-sensitive boosting for classification of imbalanced data

Pattern Recognition
Translation initiation site prediction on a genomic scale

Bioinformatics
A comparative study on rough set based class imbalance learning

Knowledge-Based Systems
Empirical analysis of support vector machine ensemble classifiers

Expert Systems with Applications: An International Journal
Boosting k-nearest neighbor classifier by means of input space projection

Expert Systems with Applications: An International Journal
The Method of Text Categorization on Imbalanced Datasets

ICCSN '09 Proceedings of the 2009 International Conference on Communication Software and Networks
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy

Evolutionary Computation
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

On the effectiveness of preprocessing methods when dealing with different levels of class imbalance

Knowledge-Based Systems
Segmentation of DNA using simple recurrent neural network

Knowledge-Based Systems
A comparative study of content statistics of coding regions in an evolutionary computation framework for gene prediction

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets

Knowledge-Based Systems
Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches

Knowledge-Based Systems
Random subspace evidence classifier

Neurocomputing
Addressing imbalanced classification with instance generation techniques: IPADE-ID

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Translation initiation site (TIS) recognition is one of the first steps in gene structure prediction, and one of the common components in any gene recognition system. Many methods have been described in the literature to identify TIS in transcribed sequences such as mRNA, EST and cDNA sequences. However, the recognition of TIS in DNA sequences is a far more challenging task, and the methods described so far for transcripts achieve poor results in DNA sequences. Most methods approach this problem taking into account its biological characteristics. In this work we try a different view, considering this classification problem from a purely machine learning perspective. From the point of view of machine learning, TIS recognition is a class imbalance problem. Thus, in this paper we approach TIS recognition from this angle, and apply the different methods that have been developed to deal with imbalanced datasets. The proposed approach has two advantages. Firstly, it improves the results using standard classification methods. Secondly, it broadens the set of classification algorithms that can be used, as some of the class-imbalance methods, such as undersampling, are also useful as methods for scaling up data mining algorithms as they reduce the size of the dataset. In this way, classifiers that cannot be applied to the whole dataset, due to long training time or large memory requirements, can be used when undersampling method is applied. Results show an advantage of class imbalance methods with respect to the same methods applied without considering the class imbalance nature of the problem. The applied methods are also able to improve the results obtained with the best method in the literature, which is based on looking for the next in-frame stop codon from the putative TIS that must be predicted.