Despite the widespread use of whole exome sequencing (WES) and whole genome sequencing (WGS) in rare disease diagnostics, many patients remain without a molecular diagnosis. Routine pipelines are highly optimized for calling single nucleotide variants (SNVs) and small insertions or deletions (indels), whereas copy number variants (CNV) and other structural rearrangements (SV) are often handled by separate assays or only investigated in selected cases. This thesis asks how far short read sequencing can be pushed toward comprehensive variant detection and how close current technologies can come to a single genome assay that jointly captures SNVs, indels, CNVs and complex structural variants. The first part of the work focuses on CNV derived from WES. Motivated by unstable and clinically unmanageable CNV call sets in in house Twist exomes, I developed a Snakemake based pipeline that integrates three exome CNV callers, ExomeDepth, ClinCNV and gCNV, under harmonized preprocessing and configuration. Benchmarking on the CNVPANEL01 reference cohort with validated CNVs and application to a heterogeneous clinical exome cohort from Burlo Garofalo Hospital showed that the three tools reach broadly similar sensitivity on curated events, especially for deletions, but differ markedly in segmentation behavior, call volumes and robustness to baseline composition. ExomeDepth is highly sensitive but prone to call inflation driven by GC dependent coverage artefacts and by the choice of normal reference panel. ClinCNV produces fewer, more contiguous events and is more tolerant of baseline heterogeneity. gCNV benefits from joint modelling of large cohorts but is computationally demanding. A consensus strategy that retains CNVs supported by at least two callers, followed by clinical annotation and frequency based filtering, compresses the call burden to a size that is more compatible with diagnostic review while preserving the likely pathogenic and pathogenic CNVs. At the same time, the results highlight intrinsic limits of exome based CNV calling, including dependence on panels of normals, discontinuous target design and incomplete coverage of regulatory regions. The second part of the thesis evaluates Illumina Constellation Mapped Read technology as a proximity aware whole genome assay that augments standard short read sequencing with long range information derived from spatially proximate clusters on the flow cell. In a cohort of six clinically characterized genomes with known structural rearrangements, including one case with matched TruSeq PCR free WGS, I compared Constellation Mapped Read and standard WGS in terms of coverage, callable genome, small variant calling, phasing and structural variant detection. Constellation maintains the small variant performance of PCR free WGS, but reconstructs long templates which enable phasing into haplotype blocks that frequently extend to tens of megabases. SV calling yields larger and more supported CNV and SV call sets than the matched TruSeq dataset, and the combination of read depth, junction level calls and genome wide colocation matrices supports detailed reconstructions of complex events, including multi step duplications at SCN1A, balanced translocations and focal microdeletions. Taken together, these results show that carefully engineered multi caller pipelines can extract clinically useful CNV information from exome data, although fundamental constraints remain, and that Constellation mapped read is a promising path toward an integrated genome assay that preserves the strengths of PCR free WGS while adding long range phasing and enhanced structural resolution.
Exome cnv calling and constellation mapped read for clinical structural variant analysis
Matteo Orlandi
2026-01-01
Abstract
Despite the widespread use of whole exome sequencing (WES) and whole genome sequencing (WGS) in rare disease diagnostics, many patients remain without a molecular diagnosis. Routine pipelines are highly optimized for calling single nucleotide variants (SNVs) and small insertions or deletions (indels), whereas copy number variants (CNV) and other structural rearrangements (SV) are often handled by separate assays or only investigated in selected cases. This thesis asks how far short read sequencing can be pushed toward comprehensive variant detection and how close current technologies can come to a single genome assay that jointly captures SNVs, indels, CNVs and complex structural variants. The first part of the work focuses on CNV derived from WES. Motivated by unstable and clinically unmanageable CNV call sets in in house Twist exomes, I developed a Snakemake based pipeline that integrates three exome CNV callers, ExomeDepth, ClinCNV and gCNV, under harmonized preprocessing and configuration. Benchmarking on the CNVPANEL01 reference cohort with validated CNVs and application to a heterogeneous clinical exome cohort from Burlo Garofalo Hospital showed that the three tools reach broadly similar sensitivity on curated events, especially for deletions, but differ markedly in segmentation behavior, call volumes and robustness to baseline composition. ExomeDepth is highly sensitive but prone to call inflation driven by GC dependent coverage artefacts and by the choice of normal reference panel. ClinCNV produces fewer, more contiguous events and is more tolerant of baseline heterogeneity. gCNV benefits from joint modelling of large cohorts but is computationally demanding. A consensus strategy that retains CNVs supported by at least two callers, followed by clinical annotation and frequency based filtering, compresses the call burden to a size that is more compatible with diagnostic review while preserving the likely pathogenic and pathogenic CNVs. At the same time, the results highlight intrinsic limits of exome based CNV calling, including dependence on panels of normals, discontinuous target design and incomplete coverage of regulatory regions. The second part of the thesis evaluates Illumina Constellation Mapped Read technology as a proximity aware whole genome assay that augments standard short read sequencing with long range information derived from spatially proximate clusters on the flow cell. In a cohort of six clinically characterized genomes with known structural rearrangements, including one case with matched TruSeq PCR free WGS, I compared Constellation Mapped Read and standard WGS in terms of coverage, callable genome, small variant calling, phasing and structural variant detection. Constellation maintains the small variant performance of PCR free WGS, but reconstructs long templates which enable phasing into haplotype blocks that frequently extend to tens of megabases. SV calling yields larger and more supported CNV and SV call sets than the matched TruSeq dataset, and the combination of read depth, junction level calls and genome wide colocation matrices supports detailed reconstructions of complex events, including multi step duplications at SCN1A, balanced translocations and focal microdeletions. Taken together, these results show that carefully engineered multi caller pipelines can extract clinically useful CNV information from exome data, although fundamental constraints remain, and that Constellation mapped read is a promising path toward an integrated genome assay that preserves the strengths of PCR free WGS while adding long range phasing and enhanced structural resolution.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_Matteo_Orlandi_firmata.pdf
accesso aperto
Licenza:
Creative commons
Dimensione
6.49 MB
Formato
Adobe PDF
|
6.49 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



