Discovery of Diagnostic Patterns from Protein Sequence Databases
PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
Hi-index | 0.00 |
The PROSITE collection of patterns for family classification of protein sequences requires much manual labour for motif finding and pattern updating, and yet has only moderate classification accuracy. Out of 1026 families with patterns in PROSITE release 16.0, there was only 523 (51%) with a diagnostic pattern, i.e., a pattern which discriminates perfectly between family and non-family sequences in the training set. Therefore, there is a need to find reliable methods for automating the processes of motiffinding and pattern construction, so that improved speed can be combined with greater classification accuracy.In this paper we present our approach to automating the construction of a collection of patterns, and we announce release 1.0 of the pattern collection built by motif-finding by analysis of multiple alignments (MAMA). MAMA is found to improve the classification accuracy over PROSITE by finding many more diagnostic patterns. On 926 tested families, MAMA finds such patterns for 771 (83%). Furthermore, both the average specificity and sensitivity of MAMA patterns are found to be higher than for PROSITE.A WWW interface that allows users to submit sequences and scan for matches in the MAMA pattern collection is available, together with a listing of all the patterns in MAMA release 1.0.