Multi-stage redundancy reduction: effective utilisation of small protein data sets

Authors:
John Hawkins;Mikael Bodén
Affiliations:
The University of Queensland, QLD, Australia;The University of Queensland, QLD, Australia
Venue:
WISB '06 Proceedings of the 2006 workshop on Intelligent systems for bioinformatics - Volume 73
Year:
2006

Citing 2
Cited 0

Bioinformatics: the machine learning approach

Bioinformatics: the machine learning approach
Weighted quality estimates in machine learning

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many important bioinformatics problems the data sets contain considerable redundancy due to the evolutionary processes which generate the data and biases in the data collection procedures. The standard practice in bioinformatics involves removing the redundancy such that there is no more than at most forty percent similarity between sequences in a data set. For small data sets this can dilute the already impoverished data beyond the boundary of practicality. One can choose to include all available data in the process by just ensuring that only the training and test samples have the required redundancy gap. However, this encourages overfitting of the model by exposure to a highly redundant training sets. We outline a process of multi-stage redundancy reduction, whereby the paucity of data can be effectively utilised without compromising the integrity of the model or the testing procedure.