A two-way multi-dimensional mixture model for clustering metagenomic sequences

Authors:
Shruthi Prabhakara;Raj Acharya
Affiliations:
Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Year:
2011

Citing 8
Cited 2

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
DNA, Words and Models

DNA, Words and Models
Real-time automatic tag recommendation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Annotation of metagenome short reads using proxygenes

Bioinformatics
Clustering Metagenome Short Reads Using Weighted Proteins

EvoBIO '09 Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
Two-way Poisson mixture models for simultaneous document classification and word clustering

Computational Statistics & Data Analysis
Treephyler

Bioinformatics
A novel abundance-based algorithm for binning metagenomic sequences using l-tuples

RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology

A two-way Bayesian mixture model for clustering in metagenomics

PRIB'11 Proceedings of the 6th IAPR international conference on Pattern recognition in bioinformatics
A probabilistic approach to accurate abundance-based binning of metagenomic reads

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivation: A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. Results: In this paper, we formulate an unsupervised naive Bayes multi-species, multi-dimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. An additional reason to use these distributions is their flexibility and ease of parameter estimation. Such a paradigm characterizes the compositional heterogeneity of the words along a genome, signifying its genome signature. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we derive an unsupervised Expectation Maximization algorithm for the models. Our method provides a general statistical framework for modeling metagenome reads. We demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and estimate the species abundance as well. Our method outperforms LikelyBin, another unsupervised composition-based binning method for metagenomes, on datasets of varying abundances, divergences and read lengths.