On Approximate Jumbled Pattern Matching in Strings

Authors:
Péter Burcsi;Ferdinando Cicalese;Gabriele Fici;Zsuzsanna Lipták
Affiliations:
Eötvös Loránd University, Department of Computer Algebra, Budapest, Hungary;University of Salerno, Dipartimento di Informatica ed Applicazioni, Salerno, Italy;I3S, UMR6070, CNRS et Université de Nice-Sophia, Antipolis, France;Bielefeld University, AG Genominformatik, Technische Fakultät, Bielefeld, Germany
Venue:
Theory of Computing Systems - Special Issue: Fun with Algorithms
Year:
2012

Citing 0
Cited 4

Near linear time construction of an approximate index for all maximum consecutive sub-sums of a sequence

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Parikh matching in the streaming model

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Binary jumbled string matching for highly run-length compressible texts

Information Processing Letters
Algorithms for computing Abelian periods of words

Discrete Applied Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t)=q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u,v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u≤q≤v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then discuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t)≤v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.