Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

  • Authors:
  • Bin Yang;Yu Peng;Henry C.M. Leung;S. M. Yiu;J. C. Chen;Francis Y.L. Chin

  • Affiliations:
  • The University of Hong Kong, Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong, Hong Kong;The University of Hong Kong, Hong Kong, Hong Kong

  • Venue:
  • Proceedings of the third international workshop on Data and text mining in bioinformatics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as "binning". Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.