The extended Burrows-Wheeler Transform (eBWT) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the BWT to a collection of strings. As opposed to other commonly used BWT-variants for string collections, the eBWT is independent of the input order of the strings in the collection, while preserving the full functionality and compressibility of the classic BWT. In our prior work [Boucher et al. Computing the original eBWT faster, simpler, and with less memory, SPIRE 2021], we presented a linear-time algorithm for constructing the eBWT. The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [Nong et al., IEEE Trans Comput 2011] with Prefix-Free Parsing [Boucher et al., Alg Mol Biol 2019; Kuhnle et. al., JCB 2020]. In this paper, we show how this construction can be extended for building the r-index of the eBWT, which we call extended r-index, an analogous data structure to the r-index of Gagie et al. [SODA 2018, JACM 2020], with the added value that circular pattern matching is also supported. Our data structure occupies O(r) words, where r is the number of runs of the eBWT of the input collection, and answers count and locate queries in time analogous to the original r-index. We also show how to efficiently support finding maximal exact matches (MEMs) using the extended r-index. We implemented the extended r-index and tested it on circular bacterial genomes and plasmids, comparing it to five state-of-the-art compressed text indexes, including the original r-index and several dictionary-based indexes. While our data structure maintains similar time and memory requirements for answering pattern matching queries as the original r-index, it is the only index in the literature that regards the strings as circular. So, it can naturally be used for both circular and linear input collections.

r-indexing the eBWT

Zsuzsanna Liptak;
2024-01-01

Abstract

The extended Burrows-Wheeler Transform (eBWT) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the BWT to a collection of strings. As opposed to other commonly used BWT-variants for string collections, the eBWT is independent of the input order of the strings in the collection, while preserving the full functionality and compressibility of the classic BWT. In our prior work [Boucher et al. Computing the original eBWT faster, simpler, and with less memory, SPIRE 2021], we presented a linear-time algorithm for constructing the eBWT. The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [Nong et al., IEEE Trans Comput 2011] with Prefix-Free Parsing [Boucher et al., Alg Mol Biol 2019; Kuhnle et. al., JCB 2020]. In this paper, we show how this construction can be extended for building the r-index of the eBWT, which we call extended r-index, an analogous data structure to the r-index of Gagie et al. [SODA 2018, JACM 2020], with the added value that circular pattern matching is also supported. Our data structure occupies O(r) words, where r is the number of runs of the eBWT of the input collection, and answers count and locate queries in time analogous to the original r-index. We also show how to efficiently support finding maximal exact matches (MEMs) using the extended r-index. We implemented the extended r-index and tested it on circular bacterial genomes and plasmids, comparing it to five state-of-the-art compressed text indexes, including the original r-index and several dictionary-based indexes. While our data structure maintains similar time and memory requirements for answering pattern matching queries as the original r-index, it is the only index in the literature that regards the strings as circular. So, it can naturally be used for both circular and linear input collections.
2024
Burrows–Wheeler Transform, extended BWT, Prefix-Free Parsing, r-index, MEMs, circular strings
File in questo prodotto:
File Dimensione Formato  
InformationComputationAccepted.pdf

solo utenti autorizzati

Tipologia: Documento in Pre-print
Licenza: Copyright dell'editore
Dimensione 6.55 MB
Formato Adobe PDF
6.55 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
1-s2.0-S0890540124000208-main.pdf

solo utenti autorizzati

Tipologia: Versione dell'editore
Licenza: Copyright dell'editore
Dimensione 2.35 MB
Formato Adobe PDF
2.35 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1120692
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact