SIMCOMP: a hybrid soft clustering of metagenome reads

Authors:
Shruthi Prabhakara;Raj Acharya
Affiliations:
Department of Computer Science and Engineering, Pennsylvania State University, State College, PA;Department of Computer Science and Engineering, Pennsylvania State University, State College, PA
Venue:
PRIB'10 Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics
Year:
2010

Citing 4
Cited 0

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Bioinformatics
Annotation of metagenome short reads using proxygenes

Bioinformatics
Clustering Metagenome Short Reads Using Weighted Proteins

EvoBIO '09 Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
CompostBin: a DNA composition-based algorithm for binning environmental shotgun reads

RECOMB'08 Proceedings of the 12th annual international conference on Research in computational molecular biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.