De novo genome assembly, the computational process to reconstruct the genomic sequence from scratch stitching together overlapping reads, plays a key role in computational biology and, to date, it cannot be considered a solved problem. Many bioinformatics approaches are available to deal with different type of data generated by diverse technologies. Assemblies relying on short read data resulted to be highly fragmented, reconstructing short contigs interrupted in repetitive region; on the other side long-read based approaches still suffer of high sequencing error rate, worsening the final consensus quality. This thesis aimed to assess the impact of different assembly approaches on the reconstruction of a highly repetitive genome, identifying the strengths and limiting the weaknesses of such approaches through the integration of orthogonal data types. Moreover, a benchmarking study has been undertaken to improve the contiguity of this genome, describing the improvements obtained thanks to the integration of additional data layers. Assemblies performed using short reads confirmed the limitation in the reconstruction of long sequences for both the software adopted. The use of long reads allowed to improve the genome assembly contiguity, reconstructing also a greater number of gene models. Despite the enhancement of contiguity, base level accuracy of long reads-based assembly could still not reach higher levels. Therefore, short reads were integrated within the assembly process to limit the base level errors present in the reconstructed sequences up to 96%. To order and orient the assembled polished contigs into longer scaffolds, data derived from three different technologies (linked read, chromosome conformation capture and optical mapping) have been analysed. The best contiguity metrics were obtained using chromosome conformation data, which permit to obtain chromosome-scale scaffolds. To evaluate the obtained results, data derived from linked reads and optical mapping have been used to identify putative misassemblies in the scaffolds. Both the datasets allowed the identification of misassemblies, highlighting the importance of integrating data derived from orthogonal technologies in the de novo assembly process. 4 This work underlines the importance of adopting bioinformatics approaches able to deal with data type generated by different technologies. In this way, results could be more accurately validated for the reconstruction of assemblies that could be eventually considered reference genomes.

Bioinformatics approaches for hybrid de novo genome assembly

Luca Marcolungo
2020-01-01

Abstract

De novo genome assembly, the computational process to reconstruct the genomic sequence from scratch stitching together overlapping reads, plays a key role in computational biology and, to date, it cannot be considered a solved problem. Many bioinformatics approaches are available to deal with different type of data generated by diverse technologies. Assemblies relying on short read data resulted to be highly fragmented, reconstructing short contigs interrupted in repetitive region; on the other side long-read based approaches still suffer of high sequencing error rate, worsening the final consensus quality. This thesis aimed to assess the impact of different assembly approaches on the reconstruction of a highly repetitive genome, identifying the strengths and limiting the weaknesses of such approaches through the integration of orthogonal data types. Moreover, a benchmarking study has been undertaken to improve the contiguity of this genome, describing the improvements obtained thanks to the integration of additional data layers. Assemblies performed using short reads confirmed the limitation in the reconstruction of long sequences for both the software adopted. The use of long reads allowed to improve the genome assembly contiguity, reconstructing also a greater number of gene models. Despite the enhancement of contiguity, base level accuracy of long reads-based assembly could still not reach higher levels. Therefore, short reads were integrated within the assembly process to limit the base level errors present in the reconstructed sequences up to 96%. To order and orient the assembled polished contigs into longer scaffolds, data derived from three different technologies (linked read, chromosome conformation capture and optical mapping) have been analysed. The best contiguity metrics were obtained using chromosome conformation data, which permit to obtain chromosome-scale scaffolds. To evaluate the obtained results, data derived from linked reads and optical mapping have been used to identify putative misassemblies in the scaffolds. Both the datasets allowed the identification of misassemblies, highlighting the importance of integrating data derived from orthogonal technologies in the de novo assembly process. 4 This work underlines the importance of adopting bioinformatics approaches able to deal with data type generated by different technologies. In this way, results could be more accurately validated for the reconstruction of assemblies that could be eventually considered reference genomes.
2020
de novo assembly
Long reads
Genome assembly
File in questo prodotto:
File Dimensione Formato  
Luca_Marcolungo_PhD_Thesis.pdf

accesso aperto

Descrizione: PhD thesis of Luca Marcolungo
Tipologia: Tesi di dottorato
Licenza: Creative commons
Dimensione 1.28 MB
Formato Adobe PDF
1.28 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1021364
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact