Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings. However, since its introduction, the term has been used more generally to describe any BWT of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.

Computing the original eBWT faster, simpler, and with less memory

Davide Cenzato;Zsuzsanna Lipták;
2021-01-01

Abstract

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings. However, since its introduction, the term has been used more generally to describe any BWT of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.
2021
978-3-030-86691-4
Burrows-Wheeler-Transform, String collections, SAIS, Big BWT, prefix-free parsing, extended BWT
File in questo prodotto:
File Dimensione Formato  
978-3-030-86692-1_11.pdf

solo utenti autorizzati

Tipologia: Versione dell'editore
Licenza: Copyright dell'editore
Dimensione 445.07 kB
Formato Adobe PDF
445.07 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1053261
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 12
  • ???jsp.display-item.citation.isi??? ND
social impact