Faster Algorithm for the Set Variant of the String Barcoding Problem

Authors:
Leszek Gąsieniec;Cindy Y. Li;Meng Zhang
Affiliations:
Department of Computer Science, University of Liverpool, Liverpool, UK;Histocompatibility and Immunogenetics Laboratory, National Blood Service, Bristol, UK;College of Computer Science and Technology, Jilin University, Changchun, China
Venue:
CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Year:
2008

Citing 6
Cited 0

String barcoding: uncovering optimal virus signatures

Proceedings of the sixth annual international conference on Computational biology
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
Optimal robust non-unique probe selection using Integer Linear Programming

Bioinformatics
DNA-BAR: distinguisher selection for DNA barcoding

Bioinformatics
Tight approximability results for test set problems in bioinformatics

Journal of Computer and System Sciences
Highly scalable algorithms for robust string barcoding

International Journal of Bioinformatics Research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A string barcoding problemis defined as to find a minimum set of substrings that distinguish between all strings in a given set of strings ${\cal S}$. In a biological sense the given strings represent a set of genomic sequences and the substrings serve as probes in a hybridisation experiment. In this paper, we study a variant of the string barcoding problem in which the substrings have to be chosen from a particular set of substrings of cardinality n. This variant can be also obtained from more general test set problem, see, e.g., [1] by fixing appropriate parameters. We present almost optimal $O(n|{\cal S}|\log^3 n)$-time approximation algorithm for the considered problem. Our approximation procedure is a modification of the algorithm due to Berman et al.[1] which obtains the best possible approximation ratio (1 + ln n), providing $NP\not\subseteq DTIME(n^{\log\log n})$. The improved time complexity is a direct consequence of more careful management of processed sets, use of several specialised graph and string data structures as well as tighter time complexity analysis based on an amortised argument.