This thesis investigates two of the most critical determinants of clinical NGS data quality: the distribution of DNA fragment lengths (insert size) and the sequencer-reported base quality scores of state-of-the-art platforms. By jointly analyzing these two aspects, it provides an integrated view of how experimental choices shape coverage and genotypability in clinical genomics, and it quantifies the relationship between platform-specific estimated quality scores and the true empirical accuracy of the data. Firstly, the thesis addresses a major gap in the insert-size literature, where fragment length has been traditionally summarized with global statistics such as mean or median, without resolving how different insert-size ranges influence coverage, mappability, and genotypability. By systematically exploring the full spectrum of insert sizes observed in whole-exome sequencing and characterizing their performance across key analytical metrics, it shows that very short inserts simultaneously reduce mappability and effective coverage, whereas long inserts preserve good mappability and variant-calling performance but at the cost of reduced coverage, indicating that appropriate optimization of insert size can jointly improve both coverage and interpretability of clinical WES data. This analysis is implemented in a fully reproducible workflow based on Snakemake, enabling standardized insert-size-resolved evaluation across experiments and cohorts. On the platform side, the thesis tackles a methodological gap in sequencer evaluation, as most comparative studies rely on aggregate indicators (such as global Q30 rates and overall variant-detection metrics) and rarely assess, in a systematic way, how well platform-specific quality scores are calibrated against true error rates or how this calibration depends on cycle and fragment length. To address this, it introduces a unified framework that uses a single PCR-free NA12878 library sequenced on multiple modern short-read platforms. Within this framework, sequencer-assigned and alignment-based scores are compared across sequencing cycles and insert-size intervals, providing a high-resolution, directly comparable assessment of data quality across technologies. This analysis reveals platform-specific discrepancies between reported and empirical accuracy and highlights cases where nominally similar quality scores correspond to different error profiles, while confirming that all evaluated platforms deliver a high proportion of genuinely high-quality bases suitable for clinical-grade analyses. Together, these contributions clarify how insert size and sequencing quality jointly shape the reliability of clinical NGS data and provide practical guidance for optimizing both library preparation and sequencing in routine genomic medicine.

Fragment size profiling and sequencer quality evaluation in clinical ngs

Luca Bertoli
2026-01-01

Abstract

This thesis investigates two of the most critical determinants of clinical NGS data quality: the distribution of DNA fragment lengths (insert size) and the sequencer-reported base quality scores of state-of-the-art platforms. By jointly analyzing these two aspects, it provides an integrated view of how experimental choices shape coverage and genotypability in clinical genomics, and it quantifies the relationship between platform-specific estimated quality scores and the true empirical accuracy of the data. Firstly, the thesis addresses a major gap in the insert-size literature, where fragment length has been traditionally summarized with global statistics such as mean or median, without resolving how different insert-size ranges influence coverage, mappability, and genotypability. By systematically exploring the full spectrum of insert sizes observed in whole-exome sequencing and characterizing their performance across key analytical metrics, it shows that very short inserts simultaneously reduce mappability and effective coverage, whereas long inserts preserve good mappability and variant-calling performance but at the cost of reduced coverage, indicating that appropriate optimization of insert size can jointly improve both coverage and interpretability of clinical WES data. This analysis is implemented in a fully reproducible workflow based on Snakemake, enabling standardized insert-size-resolved evaluation across experiments and cohorts. On the platform side, the thesis tackles a methodological gap in sequencer evaluation, as most comparative studies rely on aggregate indicators (such as global Q30 rates and overall variant-detection metrics) and rarely assess, in a systematic way, how well platform-specific quality scores are calibrated against true error rates or how this calibration depends on cycle and fragment length. To address this, it introduces a unified framework that uses a single PCR-free NA12878 library sequenced on multiple modern short-read platforms. Within this framework, sequencer-assigned and alignment-based scores are compared across sequencing cycles and insert-size intervals, providing a high-resolution, directly comparable assessment of data quality across technologies. This analysis reveals platform-specific discrepancies between reported and empirical accuracy and highlights cases where nominally similar quality scores correspond to different error profiles, while confirming that all evaluated platforms deliver a high proportion of genuinely high-quality bases suitable for clinical-grade analyses. Together, these contributions clarify how insert size and sequencing quality jointly shape the reliability of clinical NGS data and provide practical guidance for optimizing both library preparation and sequencing in routine genomic medicine.
2026
Insert size, Base quality scores, Whole-exome sequencing, Whole-genome sequencing, Mappability, Variant calling
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis_Luca_Bertoli_firmata.pdf

accesso aperto

Tipologia: Tesi di dottorato
Licenza: Creative commons
Dimensione 5.63 MB
Formato Adobe PDF
5.63 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1195687
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact