Given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛, and integers 𝑚 and 𝑑, such that 𝑛 > 𝑚 ≥ 2𝑑 > 0, we consider the problem of compressing the string 𝑆 formed by concatenating the substrings of 𝑋 of length 𝑚 starting at positions 𝑖 ≡ 1 (mod 𝑑). In particular, we provide an upper bound of (2𝑛 − 𝑚)/𝑑 + 2𝑧 + (𝑚 − 𝑑) on the size of the Lempel-Ziv (LZ77) parsing of 𝑆, where 𝑧 is the size of the parsing of 𝑋. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of 𝑆. If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.
On Compressing Collections of Substring Samples
Sara Giuliani;Zsuzsanna Liptak;
2022-01-01
Abstract
Given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛, and integers 𝑚 and 𝑑, such that 𝑛 > 𝑚 ≥ 2𝑑 > 0, we consider the problem of compressing the string 𝑆 formed by concatenating the substrings of 𝑋 of length 𝑚 starting at positions 𝑖 ≡ 1 (mod 𝑑). In particular, we provide an upper bound of (2𝑛 − 𝑚)/𝑑 + 2𝑧 + (𝑚 − 𝑑) on the size of the Lempel-Ziv (LZ77) parsing of 𝑆, where 𝑧 is the size of the parsing of 𝑋. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of 𝑆. If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.File | Dimensione | Formato | |
---|---|---|---|
ICTCS2022_1532.pdf
accesso aperto
Tipologia:
Versione dell'editore
Licenza:
Copyright dell'editore
Dimensione
1.5 MB
Formato
Adobe PDF
|
1.5 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.