CATALOGO DEI PRODOTTI DELLA RICERCA

Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses.

Comparison of Multi-site Neuroimaging Data Harmonization Techniques for Machine Learning Applications

Sampaio, Inês W.;Tassi, Emma;Bellani, Marcella;Benedetti, Francesco;Poletti, Sara;Spalletta, Gianfranco;Piras, Fabrizio;Bianchi, Anna Maria;Brambilla, Paolo;Maggioni, Eleonora

2023-01-01

Abstract

Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Codice ISBN degli atti del congresso
	
				9781665463973
			
	Parole Chiave
	
				ComBat; Confounders; Harmonization; Machine Learning; Multi-centric data
			
	Appare nelle tipologie:
	
				04.01 Contributo in atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1117828

Citazioni

ND

5

ND

social impact