Towards a comprehensive collection of diagnostic patterns for protein sequence classification

  • Authors:
  • Björn Olsson;Kim Laurio

  • Affiliations:
  • Department of Computer Science, University of Skövde, Box 408, 541 28 Skövde, Sweden;Department of Computer Science, University of Skövde, Box 408, 541 28 Skövde, Sweden

  • Venue:
  • Information Sciences—Informatics and Computer Science: An International Journal
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The PROSITE collection of patterns for family classification of protein sequences requires much manual labour for motif finding and pattern updating, and yet has only moderate classification accuracy. Out of 1026 families with patterns in PROSITE release 16.0, there was only 523 (51%) with a diagnostic pattern, i.e., a pattern which discriminates perfectly between family and non-family sequences in the training set. Therefore, there is a need to find reliable methods for automating the processes of motiffinding and pattern construction, so that improved speed can be combined with greater classification accuracy.In this paper we present our approach to automating the construction of a collection of patterns, and we announce release 1.0 of the pattern collection built by motif-finding by analysis of multiple alignments (MAMA). MAMA is found to improve the classification accuracy over PROSITE by finding many more diagnostic patterns. On 926 tested families, MAMA finds such patterns for 771 (83%). Furthermore, both the average specificity and sensitivity of MAMA patterns are found to be higher than for PROSITE.A WWW interface that allows users to submit sequences and scan for matches in the MAMA pattern collection is available, together with a listing of all the patterns in MAMA release 1.0.