Given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛, and integers 𝑚 and 𝑑, such that 𝑛 > 𝑚 ≥ 2𝑑 > 0, we consider the problem of compressing the string 𝑆 formed by concatenating the substrings of 𝑋 of length 𝑚 starting at positions 𝑖 ≡ 1 (mod 𝑑). In particular, we provide an upper bound of (2𝑛 − 𝑚)/𝑑 + 2𝑧 + (𝑚 − 𝑑) on the size of the Lempel-Ziv (LZ77) parsing of 𝑆, where 𝑧 is the size of the parsing of 𝑋. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of 𝑆. If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.

On Compressing Collections of Substring Samples

Sara Giuliani;Zsuzsanna Liptak;
2022-01-01

Abstract

Given a string 𝑋 = 𝑋[1..𝑛] of length 𝑛, and integers 𝑚 and 𝑑, such that 𝑛 > 𝑚 ≥ 2𝑑 > 0, we consider the problem of compressing the string 𝑆 formed by concatenating the substrings of 𝑋 of length 𝑚 starting at positions 𝑖 ≡ 1 (mod 𝑑). In particular, we provide an upper bound of (2𝑛 − 𝑚)/𝑑 + 2𝑧 + (𝑚 − 𝑑) on the size of the Lempel-Ziv (LZ77) parsing of 𝑆, where 𝑧 is the size of the parsing of 𝑋. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of 𝑆. If 𝑋 is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.
2022
Lempel-Ziv, LZ77, data compression, strings, repetitiveness, combinatorics on words, parsing, short reads
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1091828
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact