Towards Automatic Clustering of Protein Sequences

Authors:
Jiong Yang;Wei Wang
Affiliations:
-;-
Venue:
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Year:
2002

Citing 13
Cited 1

The power of amnesia: learning probabilistic automata with variable memory length

Machine Learning - Special issue on COLT '94
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Classifying proteins by family using the product of correlated p-values

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Modeling protein families using probabilistic suffix trees

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Approximate string matching: a simpler faster algorithm

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Faster suffix tree construction with missing suffix links

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Application of neural networks to biological data mining: a case study in protein sequence classification

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating Protein Classification Using Suffix Trees

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science

Protein structure abstractionand automatic clustering using secondary structure element sequences

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analyzing protein sequence data becomes increasingly important recently. Most previous work on this area has mainly focused on building classification models. In this paper, we investigate in the problem of automatic clustering of unlabeled protein sequences. As a widely recognized technique in statistics and computer science, clustering has been proven very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed directly on protein sequence is the lack of an effective similarity measure that can be computed efficiently. Therefore, we propose a novel model for protein sequence cluster by exploring significant statistical properties possessed by the sequences. The concept of imprecise probabilities are introduced to theoriginal probabilistic suffix tree to monitor the convergence of the empirical measurement and to guide the clustering process. It has been demonstrated that the proposed method cansuccessfully discover meaningful families without the necessity of learning models of different families from pre-labeled "training data".