Mega-classification: discovering motifs in massive datastreams

Authors:
Nomi L. Harris;Lawrence Hunter;David J. States
Affiliations:
National Library of Medicine, National Institutes of Health, Bethesda, MD;National Library of Medicine, National Institutes of Health, Bethesda, MD;National Library of Medicine, National Institutes of Health, Bethesda, MD
Venue:
AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
Year:
1992

Citing 4
Cited 0

The multiple sequence alignment problem in biology

SIAM Journal on Applied Mathematics
Where's the AI?

AI Magazine
Efficient classification of massive, unsegmented datastreams

ML92 Proceedings of the ninth international workshop on Machine learning
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. Our method is effective on very large databases, and tolerates the presence of noise in the similarity judgements and in the extents of similar regions. We applied this method to the problem of finding the sequence-level building blocks of proteins. After verifying the effectiveness of the clusterer by testing it on synthetic protein data with known evolutionary history, we applied the method to a large protein sequence database (a datastream of more than 107 elements) and found about 10,000 protein sequence classes. The motifs defined by these classes are of biological interest, and have the potential to supplement or replace the existing manual annotation of protein sequence databases.