In the last few decades, the advent of nextgeneration sequencing technologies (NGS) has dramatically reduced the cost of DNA sequencing. This has made it possible to sequence many genomes in very little time, paving the way for projects which aim at the creation of large and repetitive collections of genomic sequences. The abundance of biological data is driving the development of new memoryefficient algorithms and data structures that can scale for large datasets, thus tackling the high computational burden related to processing these data. This trend has a strong impact on the text algorithms area. In this thesis, we will study the BurrowsWheeler Transform for processing, indexing, and compressing collections of strings. Data compression addresses the problem of encoding the input to reduce the space needed for storing it, while text indexing focuses on finding ways to efficiently process and extract information from the data. In bioinformatics, these two concepts have been frequently used together since they allow the design of data structures that can efficiently process biological data while keeping the input compressed. The BurrowsWheeler Transform (BWT) is a reversible transformation on strings introduced by Michael Burrows and David J. Wheeler in 1994 that plays a central role in this area. It is the key component of several compressed data structures for text processing, like the FMindex [Ferraggina and Manzini, SODA, 2000] or the rindex [Gagie et al., SODA, 2018], and some of the most important software in bioinformatics, such as the wellknown Bowtie [Langmead et al., Genome Biology, 2009] and BWA [Li and Durbin, Bioinformatics, 2010]. The BWT was originally defined for individual strings, so when the focus moved from single sequences to string collections, there was the need to extend this transform. Over the years, several different tools and algorithms for computing BWT of string collections were introduced. However, even if the transforms generated by these tools frequently differ from each other, the problem of characterizing the BWT variants was never addressed properly. In this thesis, we close this gap by presenting the first systematic study of the BWT of string collections. We identified five nonequivalent variants computed by the tools in current use and analyzed their properties to show how exactly they differ. We complete our theoretical analysis by comparing the five BWT variants on several reallife biological datasets. We show that not only the differences among the resulting transforms can be extensive, but they also lead to significant changes in the compressibility of the BWT of the underlying string collection. As a further complication, the BWT variants in use often depend on the input order of the sequences. This significantly impacts the number of runs r, which defines the size of BWTbased compressed data structures. In this thesis, we address the problem of reordering the input sequences by providing the first implementation of the algorithm of Bentley et al. [ESA 2020], which computes the order minimizing the number of runs of the BWT. This leads to the creation of the first tool for computing the optimal BWT, i.e., the BWT variant which guarantees the minimum number of runs. We show experimentally that the input order can dramatically affect the final result: on our reallife datasets, the optimal BWT had up to 31 times fewer runs than the BWT computed without reordering the input sequences. The extended BWT (eBWT) of Mantaci et al. [Theor. Comput. Sci. 2007] is one of the first BWT variants explicitly designed to process string collections. Even though this transform is mathematically sound and has useful properties, its construction has been a problem for more than a decade. In this thesis, we present two lineartime algorithms for computing the eBWT of large string collections. The first is an improvement of the Bijective BWT construction algorithm of Bannai et al. [CPM 2019], while the second uses the Prefixfree parsing (PFP) method [Boucher et al., Algorithms Mol. Biol., 2019] to specifically process large and repetitive genomic sequence collections. In the final part of the thesis, we conclude by studying, for the first time, how to index string collections using the eBWT. We present the extended rindex, an extension of the rindex to the eBWT, which maintains the same performance as the original rindex while inheriting the properties of the eBWT. We implemented this data structure using a variant of the PFP algorithm and tested it on reallife biological datasets containing circular bacterial genomes and plasmids. We show experimentally that our index has competitive query times compared to the rindex on different pattern lengths while supporting advanced pattern matching functionalities on circular sequences.
Processing and indexing large biological datasets using the BurrowsWheeler Transform of string collections
Davide Cenzato^{}
20230101
Abstract
In the last few decades, the advent of nextgeneration sequencing technologies (NGS) has dramatically reduced the cost of DNA sequencing. This has made it possible to sequence many genomes in very little time, paving the way for projects which aim at the creation of large and repetitive collections of genomic sequences. The abundance of biological data is driving the development of new memoryefficient algorithms and data structures that can scale for large datasets, thus tackling the high computational burden related to processing these data. This trend has a strong impact on the text algorithms area. In this thesis, we will study the BurrowsWheeler Transform for processing, indexing, and compressing collections of strings. Data compression addresses the problem of encoding the input to reduce the space needed for storing it, while text indexing focuses on finding ways to efficiently process and extract information from the data. In bioinformatics, these two concepts have been frequently used together since they allow the design of data structures that can efficiently process biological data while keeping the input compressed. The BurrowsWheeler Transform (BWT) is a reversible transformation on strings introduced by Michael Burrows and David J. Wheeler in 1994 that plays a central role in this area. It is the key component of several compressed data structures for text processing, like the FMindex [Ferraggina and Manzini, SODA, 2000] or the rindex [Gagie et al., SODA, 2018], and some of the most important software in bioinformatics, such as the wellknown Bowtie [Langmead et al., Genome Biology, 2009] and BWA [Li and Durbin, Bioinformatics, 2010]. The BWT was originally defined for individual strings, so when the focus moved from single sequences to string collections, there was the need to extend this transform. Over the years, several different tools and algorithms for computing BWT of string collections were introduced. However, even if the transforms generated by these tools frequently differ from each other, the problem of characterizing the BWT variants was never addressed properly. In this thesis, we close this gap by presenting the first systematic study of the BWT of string collections. We identified five nonequivalent variants computed by the tools in current use and analyzed their properties to show how exactly they differ. We complete our theoretical analysis by comparing the five BWT variants on several reallife biological datasets. We show that not only the differences among the resulting transforms can be extensive, but they also lead to significant changes in the compressibility of the BWT of the underlying string collection. As a further complication, the BWT variants in use often depend on the input order of the sequences. This significantly impacts the number of runs r, which defines the size of BWTbased compressed data structures. In this thesis, we address the problem of reordering the input sequences by providing the first implementation of the algorithm of Bentley et al. [ESA 2020], which computes the order minimizing the number of runs of the BWT. This leads to the creation of the first tool for computing the optimal BWT, i.e., the BWT variant which guarantees the minimum number of runs. We show experimentally that the input order can dramatically affect the final result: on our reallife datasets, the optimal BWT had up to 31 times fewer runs than the BWT computed without reordering the input sequences. The extended BWT (eBWT) of Mantaci et al. [Theor. Comput. Sci. 2007] is one of the first BWT variants explicitly designed to process string collections. Even though this transform is mathematically sound and has useful properties, its construction has been a problem for more than a decade. In this thesis, we present two lineartime algorithms for computing the eBWT of large string collections. The first is an improvement of the Bijective BWT construction algorithm of Bannai et al. [CPM 2019], while the second uses the Prefixfree parsing (PFP) method [Boucher et al., Algorithms Mol. Biol., 2019] to specifically process large and repetitive genomic sequence collections. In the final part of the thesis, we conclude by studying, for the first time, how to index string collections using the eBWT. We present the extended rindex, an extension of the rindex to the eBWT, which maintains the same performance as the original rindex while inheriting the properties of the eBWT. We implemented this data structure using a variant of the PFP algorithm and tested it on reallife biological datasets containing circular bacterial genomes and plasmids. We show experimentally that our index has competitive query times compared to the rindex on different pattern lengths while supporting advanced pattern matching functionalities on circular sequences.File  Dimensione  Formato  

Revision_PhD_thesis_Davide_Cenzato.pdf
accesso aperto
Tipologia:
Tesi di dottorato
Licenza:
Creative commons
Dimensione
6.36 MB
Formato
Adobe PDF

6.36 MB  Adobe PDF  Visualizza/Apri 
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.