BAG: a graph theoretic sequence clustering algorithm

Authors:
Sun Kim;Jason Lee
Affiliations:
School of Informatics, Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47408, USA.;School of Informatics, Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47408, USA
Venue:
International Journal of Data Mining and Bioinformatics
Year:
2006

Citing 7
Cited 2

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics
Classifying molecular sequences using a linkage graph with their pairwise similarities

Theoretical Computer Science - Special issue: Genome informatics
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Data mining: concepts and techniques

Data mining: concepts and techniques
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
PLATCOM: a Platform for Computational Comparative Genomics

Bioinformatics

Clustering sequences by overlap

International Journal of Data Mining and Bioinformatics
Performance evaluation of protein sequence clustering tools

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we first discuss issues in clustering biological sequences with graph properties, which inspired the design of our sequence clustering algorithm BAG. BAG recursively utilises several graph properties: biconnectedness, articulation points, pquasi-completeness, and domain knowledge specific to biological sequence clustering. To reduce the fragmentation issue, we have developed a new metric called cluster utility to guide cluster splitting. Clusters are then merged back with less stringent constraints. Experiments with the entire COG database and other sequence databases show that BAG can cluster a large number of sequences accurately while keeping the number of fragmented clusters significantly low.