Given a string s, the Parikh vector of s, denoted p(s), counts the multi- plicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t) = q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u,v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u ≤ q ≤ v. This defini- tion encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then dis- cuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t) ≤ v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.

### On Approximate Jumbled Pattern Matching in Strings.

#### Abstract

Given a string s, the Parikh vector of s, denoted p(s), counts the multi- plicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t) = q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u,v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u ≤ q ≤ v. This defini- tion encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then dis- cuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t) ≤ v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.
##### Scheda breve Scheda completa Scheda completa (DC)
2012
String algorithms, Pattern matching, Parikh vectors, Average case analysis, Approximate search, Permuted strings
File in questo prodotto:
File
TOCS2012.pdf

solo utenti autorizzati

Tipologia: Versione dell'editore
Utilizza questo identificativo per citare o creare un link a questo documento: `https://hdl.handle.net/11562/390644`