Efficient computation of shortest absent words in a genomic sequence

Authors:
Zong-Da Wu;Tao Jiang;Wu-Jie Su
Affiliations:
Oujiang College, Wenzhou University, Wenzhou, Zhejiang, PR China;Computer College, Huazhong University of Science and Technology, Wuhan, PR China;Institute of Life Science, Jiangsu University, Zhenjiang, Jiangsu, PR China
Venue:
Information Processing Letters
Year:
2010

Citing 3
Cited 3

Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
On the Distribution of the Number of Missing Words in Random Texts

Combinatorics, Probability and Computing

Building phylogeny with minimal absent words

CIAA'11 Proceedings of the 16th international conference on Implementation and application of automata
Using minimal absent words to build phylogeny

Theoretical Computer Science
Computing the longest common prefix array based on the Burrows-Wheeler transform

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.89

Visualization

Abstract

Analyzing sequence composition is a basic task in genomic research. In this paper, to efficiently compute shortest absent words in a genomic sequence, we present a linear-time algorithm, which firstly estimates the length of shortest absent words by probabilistic method, and then based on such estimation, finds out all shortest absent words in a genomic sequence. Our algorithm only needs to scan the genomic sequence once without the space requirements of index structures such as suffix trees and suffix arrays. Experimental results show that our algorithm uses only 1.5 minutes for the computation of shortest absent words in human genome, and therefore is more efficient than any other existing algorithms.