Neural Networks for Full-Scale Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition

Authors:
Cathy Wu;Michael Berry;Sailaja Shivakumar;Jerry McLarty
Affiliations:
Department of Epidemiology/Biomathematics, The University of Texas Health Center at Tyler, Tyler, Texas 75710. wu@jason.uthct.edu;Department of Computer Science, University of Tennessee, Knoxville, Tennessee 37996-1301;Department of Epidemiology/Biomathematics, The University of Texas Health Center at Tyler, Tyler, Texas 75710;Department of Epidemiology/Biomathematics, The University of Texas Health Center at Tyler, Tyler, Texas 75710
Venue:
Machine Learning - Special issue on applications in molecular biology
Year:
1995

Citing 0
Cited 16

Computational Methods for Intelligent Information Access

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Counter-Propagation Neural Networks for Molecular Sequence Classification: Supervised LVQ and Dynamic Node Allocation

Applied Intelligence
Estimating the Jacobian of the Singular Value Decomposition: Theory and Applications

ECCV '00 Proceedings of the 6th European Conference on Computer Vision-Part I
Protein Sequences Classification Using Modular RBF Neural Networks

AI '02 Proceedings of the 15th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Improving Biological Sequence Property Distances by Using a Genetic Algorithm

IWANN '01 Proceedings of the 6th International Work-Conference on Artificial and Natural Neural Networks: Bio-inspired Applications of Connectionism-Part II
Dimensionality Reduction through Sub-space Mapping for Nearest Neighbor Algorithms

ECML '00 Proceedings of the 11th European Conference on Machine Learning
Mining biomolecular data using background knowledge and artificial neural networks

Handbook of massive data sets
Gene classification artificial neural system

INBS '95 Proceedings of the First International Symposium on Intelligence in Neural and Biological Systems (INBS'95)
New techniques for extracting features from protein sequences

IBM Systems Journal - Deep computing for the life sciences
Peptide programs: applying fragment programs to protein classification

Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
The Learning Grid and E-Assessment using Latent Semantic Analysis

Proceedings of the 2005 conference on Towards the Learning Grid: Advances in Human Learning Services
Early prediction of temporal sequences based on information transfer

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Integrated mining for cancer incidence factors from healthcare data

AM'03 Proceedings of the Second international conference on Active Mining
Fast protein superfamily classification using principal component null space analysis

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
E-assessment using latent semantic analysis

3LeGE-WG'03 Proceedings of the 3rd international LeGE-WG conference on GRID Infrastructure to Support Future Technology Enhanced Learning
Gathering requirements for a grid-based automatic marking system

ELeGI'05 Proceedings of the 1st international ELeGI conference on Advanced Technology for Enhanced Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences of n-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparse n-gram input vectors and captures semantics of n-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.