Neural Networks for Full-Scale Protein Sequence Classification: Sequence Encoding with Singular Value Decomposition

  • Authors:
  • Cathy Wu;Michael Berry;Sailaja Shivakumar;Jerry McLarty

  • Affiliations:
  • Department of Epidemiology/Biomathematics, The University of Texas Health Center at Tyler, Tyler, Texas 75710. wu@jason.uthct.edu;Department of Computer Science, University of Tennessee, Knoxville, Tennessee 37996-1301;Department of Epidemiology/Biomathematics, The University of Texas Health Center at Tyler, Tyler, Texas 75710;Department of Epidemiology/Biomathematics, The University of Texas Health Center at Tyler, Tyler, Texas 75710

  • Venue:
  • Machine Learning - Special issue on applications in molecular biology
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

A neural network classification method has been developed as an alternative approach to the search/organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences of n-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparse n-gram input vectors and captures semantics of n-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.