The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1,...,aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1,...,pσ), where pi =|{j|sj =ai}|.Parikhvectorqoccursinsifshasasubstringtwithp(t)=q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time.The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n2) time.The second algorithm finds all occurrences of a given Parikh vector in a text over anarbitrary alphabet of size σ ≥ 2 and has sub-linear expected time complexity. More pre-cisely, we present two variants of the algorithm, both using an O(n) size data structure,each of which can be constructed in O(n) time. The first solution is very simple and σ 1/2 log measy to implement and leads to an expected query time of O(n( log σ ) √m ), where m = i qi is the length of a string with Parikh vector q. The second uses wavelet treesσ 1/2 1 and improves the expected runtime to O(n( log σ ) √m ), i.e., by a factor of log m.
Algorithms for Jumbled Pattern Matching in Strings
Cicalese, Ferdinando;Liptak, Zsuzsanna
2012-01-01
Abstract
The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1,...,aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1,...,pσ), where pi =|{j|sj =ai}|.Parikhvectorqoccursinsifshasasubstringtwithp(t)=q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time.The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n2) time.The second algorithm finds all occurrences of a given Parikh vector in a text over anarbitrary alphabet of size σ ≥ 2 and has sub-linear expected time complexity. More pre-cisely, we present two variants of the algorithm, both using an O(n) size data structure,each of which can be constructed in O(n) time. The first solution is very simple and σ 1/2 log measy to implement and leads to an expected query time of O(n( log σ ) √m ), where m = i qi is the length of a string with Parikh vector q. The second uses wavelet treesσ 1/2 1 and improves the expected runtime to O(n( log σ ) √m ), i.e., by a factor of log m.File | Dimensione | Formato | |
---|---|---|---|
IJFCS2012.pdf
solo utenti autorizzati
Tipologia:
Versione dell'editore
Licenza:
Copyright dell'editore
Dimensione
781 kB
Formato
Adobe PDF
|
781 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.