PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM Approach

  • Authors:
  • Xiao Zhu;Henry C.M. Leung;Francis Y.L. Chin;Siu Ming Yiu;Guangri Quan;Bo Liu;Yadong Wang

  • Affiliations:
  • School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China;Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong;Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong;Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong;National Pilot School of Software, Harbin Institute of Technology, WeiHai 264209, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China;School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

  • Venue:
  • Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of single-end and paired-end reads when resolving branches, e.g. the number and positions of reads supporting each possible extension are not taken into account when resolving branches. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds. Instead of using single-end reads to construct contig, PERGA uses paired-end reads and different read overlap size thresholds ranging from Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contigs by all feasible extensions and determine the correct extension by using look ahead technology. We evaluated PERGA on both simulated Illumina data sets and real data sets, and it constructed longer and more correct contigs and scaffolds than other state-of-the-art assemblers IDBA-UD, Velvet, ABySS, SGA and CABOG. Availability: https://github.com/hitbio/PERGA