The power of amnesia: learning probabilistic automata with variable memory length
Machine Learning - Special issue on COLT '94
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Classifying proteins by family using the product of correlated p-values
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Modeling protein families using probabilistic suffix trees
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Approximate string matching: a simpler faster algorithm
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Faster suffix tree construction with missing suffix links
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating Protein Classification Using Suffix Trees
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Optimal suffix tree construction with large alphabets
FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Protein structure abstractionand automatic clustering using secondary structure element sequences
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Hi-index | 0.00 |
Analyzing protein sequence data becomes increasingly important recently. Most previous work on this area has mainly focused on building classification models. In this paper, we investigate in the problem of automatic clustering of unlabeled protein sequences. As a widely recognized technique in statistics and computer science, clustering has been proven very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed directly on protein sequence is the lack of an effective similarity measure that can be computed efficiently. Therefore, we propose a novel model for protein sequence cluster by exploring significant statistical properties possessed by the sequences. The concept of imprecise probabilities are introduced to theoriginal probabilistic suffix tree to monitor the convergence of the empirical measurement and to guide the clustering process. It has been demonstrated that the proposed method cansuccessfully discover meaningful families without the necessity of learning models of different families from pre-labeled "training data".