Random Forest (RF) distances represent powerful data-dependent measures, which usefulness has been shown in many different contexts. Very recently, they have been extended to deal with missing data, a crucial problem which affects many biomedical domains. However, their characterization has been mainly theoretical, with simple experiments aimed at showing that the resulting pairwise distances, in the presence of missing data, are nearly equivalent to those computed with complete data, without focusing on a specific task like classification or clustering. In this paper, we take a fundamental step forward in their evaluation, showing their usefulness in a challenging real-world scenario: the classification of Multiple Sclerosis (MS) according to the levels of various cerebrospinal fluid (CSF) protein markers. We based our experiments on MissRatioRF, a state-of-the-art RF-distance adapted for missing data. A thorough experimental evaluation on real data shows that this RF-distance outperforms all other state-of-the-art distances, for many–even severe–degrees of missingness.

Multiple Sclerosis Classification via Random Forest Distances Robust to Missing Data

Mensi, Antonella
;
di Maria, Vincenzo;Barusolo, Elena;Magliozzi, Roberta;Bicego, Manuele
2025-01-01

Abstract

Random Forest (RF) distances represent powerful data-dependent measures, which usefulness has been shown in many different contexts. Very recently, they have been extended to deal with missing data, a crucial problem which affects many biomedical domains. However, their characterization has been mainly theoretical, with simple experiments aimed at showing that the resulting pairwise distances, in the presence of missing data, are nearly equivalent to those computed with complete data, without focusing on a specific task like classification or clustering. In this paper, we take a fundamental step forward in their evaluation, showing their usefulness in a challenging real-world scenario: the classification of Multiple Sclerosis (MS) according to the levels of various cerebrospinal fluid (CSF) protein markers. We based our experiments on MissRatioRF, a state-of-the-art RF-distance adapted for missing data. A thorough experimental evaluation on real data shows that this RF-distance outperforms all other state-of-the-art distances, for many–even severe–degrees of missingness.
2025
9783032101914
Missing data; Multiple Sclerosis; Random Forest Distances
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1193779
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact