Exploring machine learning methods to detect associations between genetic factors and complex traits on large-scale data

Dagnogo, Dramane

Chronic Kidney Disease (CKD) is a significant global health issue, with a prevalence of 10% in the general population worldwide and 13.2% in the Veneto region (Northern Italy). CKD is a risk factor for various pathophysiological conditions, associated with a considerable risk of death and substantial healthcare costs. Detection of biomarkers could be essential for early-risk individual screening, which could reduce its socio-economic impact and improve patient care. However, Genome-Wide Association Studies (GWAS), which are the main method of biomarkers discovery, face challenges such as the sample size and the sample distribution, leading to statistical power issues, increasing the risk of false positives and false negatives. Machine learning (ML) methods can be useful in addressing these challenges. Combining a sampler with a machine learning program holds the promise of improving predictive performance in dealing with imbalanced datasets. We aim to implement an ML model to enhance the detection of signal intensity from small sample size and/or imbalanced genomic data and explore leading biomarkers of CKD trait in the INCIPE cohort. We investigated 1,738 samples (1,525 controls vs. 213 cases) from the INCIPE cohort. We performed an association test after applying the standard quality control procedures of PLINK2. Then, we selected and normalized all the SNPs that achieved nominal significance (p-value <= 0.05) for downstream analysis. Furthermore, we implemented a nested ensemble model composed of a random under sampler and a CatBoostClassifier (NCBC). We used the selected data to train and test the NCBC model, and important SNPs were identified using feature information gain and Shapley Additive Values. Subsequently, functional analysis of affected genes was conducted using various databases. The performance of the NCBC model was evaluated using simulated genotype data, and we performed a sequential Chi-square homogeneity test to assess the effect of important SNPs on simulated data distribution. Feature importance analysis selected sixty-seven (67) variants as the main contributors to the NCBC model to determine the disease status. Twenty-nine (29) of the aforementioned variants are intergenic SNPs, and the remaining 38 markers are related to 33 genes. GWAS catalog screening showed that ABCA4, CDH20, E2F3, and RREB1 have been reported to be involved in glomerular filtration rate (GFR), KCNIP4 in C-reactive protein 3measurement. Moreover, ABCA13, ABCA4, and ABCC4 are associated with ATP Binding Cassette (ABC) transporters pathways (P-value <= 0.05), which play a role in calcium transportation in diverse cells. GTEx expression profile highlight that ABCA4 is strongly expressed in kidney tissues (medulla, cortex) suggesting a potential implication in kidney function for this gene historically known to be retina-specific. More broadly, eighteen (18) genes out of the 33 genes selected by NCBC are connected to CKDGen consortium well known genes through the same graph in STRING, evidencing their contribution to CKD related disease. NCBC model has shown evidence, using simulated data, of being a suitable alternative to detect biomarkers from imbalanced genotyping data. It also implicated the ABC transport pathways as a key role player in the INCIPE cohort implying the ABCA13, ABCA4, and ABCC4 genes. The gene ABCA4 historically described as specific to the retina, could be involved in kidney function.