It is known that the exact form of the Burrows-Wheeler Transform (BWT) of a string col- lection depends, in most implementations, on the input order of the strings in the collection. Reordering strings of an input collection affects the number of equal-letter runs r, arguably the most important parameter of BWT-based data structures, such as the FM-index or the r-index. Bentley, Gibney, and Thankachan [ESA 2020] introduced a linear-time algorithm for computing the permutation of the input collection which yields the minimum number of runs of the resulting BWT. In this paper, we present the first tool that guarantees a Burrows-Wheeler Transform with minimum number of runs (optBWT), by combining i) an algorithm that builds the BWT from a string collection (either SAIS-based [Boucher et al., SPIRE 2021] or BCR [Bauer et al., CPM 2011]); ii) the SAP array data structure introduced in [Cox et al., Bioinformatics, 2012]; and iii) the algorithm by Bentley et al. We present results both on real-life and simulated data, showing that the improvement achieved in terms of r with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making our tool competitive with other tools for BWT-computation in terms of running time and space usage. In particular, on real data the optBWT obtains up to 31 times fewer runs with only a 1.39× slowdown. Source code is available at https://github.com/davidecenzato/optimalBWT.git.

Computing the optimal BWT of very large string collections

Davide Cenzato;Zsuzsanna Liptak;
2023-01-01

Abstract

It is known that the exact form of the Burrows-Wheeler Transform (BWT) of a string col- lection depends, in most implementations, on the input order of the strings in the collection. Reordering strings of an input collection affects the number of equal-letter runs r, arguably the most important parameter of BWT-based data structures, such as the FM-index or the r-index. Bentley, Gibney, and Thankachan [ESA 2020] introduced a linear-time algorithm for computing the permutation of the input collection which yields the minimum number of runs of the resulting BWT. In this paper, we present the first tool that guarantees a Burrows-Wheeler Transform with minimum number of runs (optBWT), by combining i) an algorithm that builds the BWT from a string collection (either SAIS-based [Boucher et al., SPIRE 2021] or BCR [Bauer et al., CPM 2011]); ii) the SAP array data structure introduced in [Cox et al., Bioinformatics, 2012]; and iii) the algorithm by Bentley et al. We present results both on real-life and simulated data, showing that the improvement achieved in terms of r with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making our tool competitive with other tools for BWT-computation in terms of running time and space usage. In particular, on real data the optBWT obtains up to 31 times fewer runs with only a 1.39× slowdown. Source code is available at https://github.com/davidecenzato/optimalBWT.git.
2023
Burrows-Wheeler Transform, big data, string collections, compression, optimal BWT
File in questo prodotto:
File Dimensione Formato  
Computing_the_optimal_BWT_of_very_large_string_collections.pdf

solo utenti autorizzati

Tipologia: Versione dell'editore
Licenza: Copyright dell'editore
Dimensione 581.27 kB
Formato Adobe PDF
581.27 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1091846
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
social impact