Distinguishing Endogenous Retroviral LTRs from SINE Elements Using Features Extracted from Evolved Side Effect Machines

Authors:
Wendy Ashlock;Suprakash Datta
Affiliations:
York University, Toronto;York University, Toronto
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2012

Citing 8
Cited 1

Handbook of Evolutionary Computation

Handbook of Evolutionary Computation
Random Forests

Machine Learning
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
A review of feature selection techniques in bioinformatics

Bioinformatics
TEclass—a tool for automated classification of unknown eukaryotic transposable elements

Bioinformatics
Evaluation of different complexity measures for signal detection in genome sequences

Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
Evolutionary Computation for Modeling and Optimization

Evolutionary Computation for Modeling and Optimization
Large-scale training of SVMs with automata kernels

CIAA'10 Proceedings of the 15th international conference on Implementation and application of automata

Signal detection in genome sequences using complexity based features

Proceedings of the 12th International Workshop on Data Mining in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Side effect machines produce features for classifiers that distinguish different types of DNA sequences. They have the, as yet unexploited, potential to give insight into biological features of the sequences. We introduce several innovations to the production and use of side effect machine sequence features. We compare the results of using consensus sequences and genomic sequences for training classifiers and find that more accurate results can be obtained using genomic sequences. Surprisingly, we were even able to build a classifier that distinguished consensus sequences from genomic sequences with high accuracy, suggesting that consensus sequences are not always representative of their genomic counterparts. We apply our techniques to the problem of distinguishing two types of transposable elements, solo LTRs and SINEs. Identifying these sequences is important because they affect gene expression, genome structure, and genetic diversity, and they serve as genetic markers. They are of similar length, neither codes for protein, and both have many nearly identical copies throughout the genome. Being able to efficiently and automatically distinguish them will aid efforts to improve annotations of genomes. Our approach reveals structural characteristics of the sequences of potential interest to biologists.