Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses.
Comparison of Multi-site Neuroimaging Data Harmonization Techniques for Machine Learning Applications
Bellani, Marcella;
2023-01-01
Abstract
Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.