Mega-classification: discovering motifs in massive datastreams

  • Authors:
  • Nomi L. Harris;Lawrence Hunter;David J. States

  • Affiliations:
  • National Library of Medicine, National Institutes of Health, Bethesda, MD;National Library of Medicine, National Institutes of Health, Bethesda, MD;National Library of Medicine, National Institutes of Health, Bethesda, MD

  • Venue:
  • AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
  • Year:
  • 1992

Quantified Score

Hi-index 0.00

Visualization

Abstract

We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. Our method is effective on very large databases, and tolerates the presence of noise in the similarity judgements and in the extents of similar regions. We applied this method to the problem of finding the sequence-level building blocks of proteins. After verifying the effectiveness of the clusterer by testing it on synthetic protein data with known evolutionary history, we applied the method to a large protein sequence database (a datastream of more than 107 elements) and found about 10,000 protein sequence classes. The motifs defined by these classes are of biological interest, and have the potential to supplement or replace the existing manual annotation of protein sequence databases.