The Burrows-Wheeler Transform (BWT) is a powerful transform widely used in string compression and string processing. It produces a reversible permutation of the characters of the input string, often allowing for easier compression of the string, while also enabling fast pattern matching. While the BWT is defined for every word, not every word is the BWT of some word. A characterization of BWT images was given in [Likhomanov and Shur, CSR, 2011], based on the standard permutation of the string. Often an end-of-file character dollar is added to mark the end of a string. Given a string 𝑤, it is an interesting combinatorial question where a $ can be inserted to make 𝑤 the BWT of some string 𝑣dollar. This question was answered in [Giuliani et al. Theor. Comput. Sci. 2021], where an efficient algorithm was presented for computing all such positions (called nice positions), and a characterization of nice positions was given, based on pseudo-cycles in the standard permutation of 𝑤. In this paper, we give a stronger characterization of nice positions: We show that these can be characterized using only essential pseudo-cycles, which constitute a small subset of all possible pseudo- cycles. We present an algorithm to compute all essential pseudo-cycles of a word 𝑤. In the second part of the paper, we study nice positions of fully clustered words: these are words 𝑤 whose number of runs equals the number of distinct characters occurring in 𝑤. Fully clustered words are of particular interest due to their extreme compressibility, and words whose BWT is fully clustered have been studied extensively. We are interested in the number of nice positions of fully clustered words, as well as in the number of fully clustered words with 𝑘 nice positions, for fixed 𝑘 and word length.
When a Dollar in a Fully Clustered Word Makes a BWT
Sara Giuliani;Zsuzsanna Liptak;Francesco Masillo
2022-01-01
Abstract
The Burrows-Wheeler Transform (BWT) is a powerful transform widely used in string compression and string processing. It produces a reversible permutation of the characters of the input string, often allowing for easier compression of the string, while also enabling fast pattern matching. While the BWT is defined for every word, not every word is the BWT of some word. A characterization of BWT images was given in [Likhomanov and Shur, CSR, 2011], based on the standard permutation of the string. Often an end-of-file character dollar is added to mark the end of a string. Given a string 𝑤, it is an interesting combinatorial question where a $ can be inserted to make 𝑤 the BWT of some string 𝑣dollar. This question was answered in [Giuliani et al. Theor. Comput. Sci. 2021], where an efficient algorithm was presented for computing all such positions (called nice positions), and a characterization of nice positions was given, based on pseudo-cycles in the standard permutation of 𝑤. In this paper, we give a stronger characterization of nice positions: We show that these can be characterized using only essential pseudo-cycles, which constitute a small subset of all possible pseudo- cycles. We present an algorithm to compute all essential pseudo-cycles of a word 𝑤. In the second part of the paper, we study nice positions of fully clustered words: these are words 𝑤 whose number of runs equals the number of distinct characters occurring in 𝑤. Fully clustered words are of particular interest due to their extreme compressibility, and words whose BWT is fully clustered have been studied extensively. We are interested in the number of nice positions of fully clustered words, as well as in the number of fully clustered words with 𝑘 nice positions, for fixed 𝑘 and word length.File | Dimensione | Formato | |
---|---|---|---|
ICTCS2022_4242.pdf
accesso aperto
Tipologia:
Versione dell'editore
Licenza:
Copyright dell'editore
Dimensione
1.27 MB
Formato
Adobe PDF
|
1.27 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.