Read Annotation Pipeline for High-Throughput Sequencing Data

Authors:
James Holt;Shunping Huang;Leonard McMillan;Wei Wang
Affiliations:
Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA;Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA;Dept. of Computer Science, University of North Carolina, Chapel Hill, NC 27599, USA;Dept. of Computer Science, University of California, Los Angeles, CA 90095, USA
Venue:
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Year:
2013

Citing 6
Cited 1

TopHat

Bioinformatics
The Sequence Alignment/Map format and SAMtools

Bioinformatics
Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data

Bioinformatics
Read count approach for DNA copy number variants detection

Bioinformatics
POPE: pipeline of parentally-biased expression

ISBRA'12 Proceedings of the 8th international conference on Bioinformatics Research and Applications
Transforming Genomes Using MOD Files with Applications

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Transforming Genomes Using MOD Files with Applications

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mapping reads to a reference sequence is a common step when analyzing allele effects in high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending the genetic distances of the target sequences from the reference. To avoid this bias researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings, and the selection of which variants to include to remove biases. To address these issues, we propose a novel and generic multi-alignment pipeline. Our pipeline integrates the genomic variations from known or suspected founders into separate reference sequences and performs alignments to each one. By mapping reads to multiple reference sequences and merging them afterward, we are able to rescue more reads and diminish the bias caused by using a single common reference. Moreover, the genomic origin of each read is determined and annotated during the merging process, providing a better source of information to assess differential expression than simple allele queries at known variant positions. Using RNA-seq of a diallel cross, we compare our pipeline with the single reference pipeline and demonstrate our advantages of more aligned reads and a higher percentage of reads with assigned origins.