The analysis of Next-Generation Sequencing (NGS) data for the identification of DNA genetic variants presents several bioinformatics challenges. The main requirements of the analysis are the accuracy and the reproducibility of results, as their clinical interpretation may be influenced by many variables, from the sample processing to the adopted bioinformatics algorithms. Targeted resequencing, which aim is the enrichment of genomic regions to identify genetic variants possibly associated to clinical diseases, bases the quality of its data on the depth and uniformity of coverage, for the differentiation between true and false positives findings. Many variant callers have been developed to reach the best accuracy considering these metrics, but they can’t work in regions of the genome where short reads cannot align uniquely (uncallable regions). The misalignment of reads on the reference genome can arise when reads are too short to overcome repetitious regions of the genome, causing the software to assign a low-quality score to the read pairs of the same fragment. A limitation of this process is that variant callers are not able to call variants in these regions, unless the quality of one of the two read mates could increase. Moreover, current metrics are not able to define with accuracy these regions, lacking in providing this information to the final customer. For this reason, a more accurate metric is needed to clearly report the uncallable genomic regions, with the prospect to improve the data analysis to possibly investigate them. This work aimed to improve the callability (genotypability) of the target regions for a more accurate data analysis and to provide a high-quality variant calling. Different experiments have been conducted to prove the relevance of genotypability for the evaluation of targeted resequencing performance. Firstly, this metric showed that increasing the depth of sequencing to rescue variants is not necessary at thresholds where genotypability reaches saturation (70X). To improve this metric and to evaluate the accuracy and reproducibility of results on different enrichment technologies for WES sample processing, the genotypability was evaluated on four exome platforms using three different DNA fragment lengths (short: ~200, medium: ~350, long: ~500 bp). Results showed that mapping quality could successfully increase on all platforms extending the fragment, hence increasing the distance between the read pairs. The genotypability of many genes, including several ones associated to a clinical phenotype, could strongly improve. Moreover, longer libraries increased uniformity of coverage for platforms that have not been completely optimized for short fragments, further improving their genotypability. Given the relevance of the quality of data derived, especially from the extension of the short fragments to the medium ones, a deeper investigation was performed to identify a potential threshold of fragment length above which the improvement in genotypability was significant. On the enrichment platform producing the higher enrichment uniformity (Twist), the fragments above 230 bp could obtain a meaningful improvement of genotypability (almost 1%) and a high uniformity of coverage of the target. Interestingly, the extension of the DNA fragment showed a greater influence on genotypability in respect on the solely uniformity of coverage. The enhancement of genotypability for a more accurate bioinformatics analysis of the target regions provided at limited costs (less sequencing) the investigation of regions of the genome previously defined as uncallable by current NGS methodologies.
Enhanced genotypability for a more accurate variant calling in targeted resequencing
Barbara Iadarola
2020-01-01
Abstract
The analysis of Next-Generation Sequencing (NGS) data for the identification of DNA genetic variants presents several bioinformatics challenges. The main requirements of the analysis are the accuracy and the reproducibility of results, as their clinical interpretation may be influenced by many variables, from the sample processing to the adopted bioinformatics algorithms. Targeted resequencing, which aim is the enrichment of genomic regions to identify genetic variants possibly associated to clinical diseases, bases the quality of its data on the depth and uniformity of coverage, for the differentiation between true and false positives findings. Many variant callers have been developed to reach the best accuracy considering these metrics, but they can’t work in regions of the genome where short reads cannot align uniquely (uncallable regions). The misalignment of reads on the reference genome can arise when reads are too short to overcome repetitious regions of the genome, causing the software to assign a low-quality score to the read pairs of the same fragment. A limitation of this process is that variant callers are not able to call variants in these regions, unless the quality of one of the two read mates could increase. Moreover, current metrics are not able to define with accuracy these regions, lacking in providing this information to the final customer. For this reason, a more accurate metric is needed to clearly report the uncallable genomic regions, with the prospect to improve the data analysis to possibly investigate them. This work aimed to improve the callability (genotypability) of the target regions for a more accurate data analysis and to provide a high-quality variant calling. Different experiments have been conducted to prove the relevance of genotypability for the evaluation of targeted resequencing performance. Firstly, this metric showed that increasing the depth of sequencing to rescue variants is not necessary at thresholds where genotypability reaches saturation (70X). To improve this metric and to evaluate the accuracy and reproducibility of results on different enrichment technologies for WES sample processing, the genotypability was evaluated on four exome platforms using three different DNA fragment lengths (short: ~200, medium: ~350, long: ~500 bp). Results showed that mapping quality could successfully increase on all platforms extending the fragment, hence increasing the distance between the read pairs. The genotypability of many genes, including several ones associated to a clinical phenotype, could strongly improve. Moreover, longer libraries increased uniformity of coverage for platforms that have not been completely optimized for short fragments, further improving their genotypability. Given the relevance of the quality of data derived, especially from the extension of the short fragments to the medium ones, a deeper investigation was performed to identify a potential threshold of fragment length above which the improvement in genotypability was significant. On the enrichment platform producing the higher enrichment uniformity (Twist), the fragments above 230 bp could obtain a meaningful improvement of genotypability (almost 1%) and a high uniformity of coverage of the target. Interestingly, the extension of the DNA fragment showed a greater influence on genotypability in respect on the solely uniformity of coverage. The enhancement of genotypability for a more accurate bioinformatics analysis of the target regions provided at limited costs (less sequencing) the investigation of regions of the genome previously defined as uncallable by current NGS methodologies.File | Dimensione | Formato | |
---|---|---|---|
PhD_Thesis_Barbara_Iadarola.pdf
accesso aperto
Descrizione: Tesi di dottorato
Tipologia:
Tesi di dottorato
Licenza:
Creative commons
Dimensione
2.55 MB
Formato
Adobe PDF
|
2.55 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.