Precise variant calling in the clinical settings

Lavezzari, Denise

Identifying high quality variants in whole exome sequencing (WES) analysis can be very complex due to the different modifications that can be made in the sample sequencing preparation protocol. This can adversely affect bioinformatics analysis in the identification of variants. The evaluation and correlation of the quality parameters of each analysis stage could help to obtain a better accuracy and precision in the identification of the variants. Furthermore, after identifying high quality variants, the use of reference databases where the clinical significance and frequency of the variants can be consulted, allows for a more accurate diagnosis. During laboratory and bioinformatics analysis, it is possible to calculate many metrics to evaluate the quality of the data being processed. All this data is usually looked at separately and their history is lost over time. Besides, the process of comparing a new workflow to existing ones can be very time-consuming when done manually. In addition, for a significant diagnosis of rare variants, it is important to consider the variant frequency in the sample population. For this reason, a database that incorporates all quality metrics from the entire WES analysis over time and collects population-specific variants for accurate clinical variant identification, is needed. This thesis aims to optimise the evaluation of quality metrics and the classification of variants in the Italian population through the creation of a Structured Query Language (SQL) database directly linked to a website for more intuitive use. The thesis sets out the structure of the database and the configuration of the web page created. Furthermore, during the writing of the thesis, approximately 2,500 exomes were analysed and all quality control parameters derived from both laboratory and bioinformatics analyses were collected. All the data obtained were uploaded to the database in order to verify the usefulness of the application in monitoring data quality trends over time and in identifying possible problems. Two examples of problems identified by the implemented application and subsequently solved by modifications to the laboratory protocol are presented. Moreover, the potential of the database to simplify comparisons between existing and new laboratory protocols storing quality control parameters, is shown. All variants identified in the analysed samples were uploaded to create an accessible reference of genetic variation in Italians. The correct classification of the Italian variants is shown in relation to renowned databases that only report a broader view of the European population. This approach enables researchers to classify variants that are not observed in the most widely used databases (gnomAD Exomes, ExAC, 1KgPhase3). It also allows the identification of rare variants that are generally classified as common and might represent a disease predisposition in the Italian population. In addition, it is possible to recognize common and non-damaging variants in the Italian population that are classified as rare in the European population. In conclusion, the reported results and examples have shown how the new application (extended database with its own website) simplifies and facilitates the identification of problems in clinical WES analysis. It also makes the comparison between the various laboratory protocols easier, allowing for more precision in exome analysis aimed at identifying variants. Finally, the specific investigation of the Italian variants could improve diagnostic accuracy in the specific population.