Alignment seeding strategies using contiguous pyrimidine purine matches

  • Authors:
  • Minmei Hou;Louxin Zhang;Robert S. Harris

  • Affiliations:
  • Northern Illinois University, DeKalb, IL;National University of Singapore, Singapore;Penn State University, PA

  • Venue:
  • Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large-scale genomic pairwise aligners usually start with a seeding procedure, which scans two sequences to obtain base matches (called hits) that follow a certain pattern (called a seed). The seed pattern and size determine the sensitivity and specificity of the seeding procedure and greatly affect the alignment accuracy and computational efficiency. Much effort has been focused on obtaining an optimal (set of) spaced seed(s) to improve sensitivity. However, specificity also becomes a big concern when aligning very long genomic sequences. We present a seeding strategy that identifies contiguous pyrimidine purine (py·pu) matches. This model may improve sensitivity and specificity simultaneously compared to a contiguous base match model. We further present a seeding strategy that identifies contiguous py·pu matches with at least a certain number of contiguous base matches. This model significantly improves sensitivity and specificity simultaneously compared to the base match model. It can also achieve better sensitivity than an optimal spaced seed without loss of specificity, when the ratio of transition to transversion is high. Our examination on the CFTR region of 2M bases between human and mouse shows that this new model can have very high specificity without much loss of sensitivity compared to an optimal spaced seed. Based on the characteristics (e.g. the sequence similarity, the ratio between transition and transversion, and the lengths of gapless alignments) of alignments between human and other mammals, the new seeding strategies are promising in improving alignment quality of a wide selection of species pairs. This paper also lays the groundwork for future advancement of applying spaced patterns in these seeding strategies.