In recent years there has been an increased interest in clustering methods based on Random Forests, due to their flexibility and their capability in describing data. One problem of current RF-clustering approaches is that they are not able to directly deal with missing data, a common scenario in many application fields (e.g. Bioinformatics): the usual solution in this case is to pre-impute incomplete data before running standard clustering methods. In this paper we present the first Random Forest clustering approach able to directly deal with missing data. We start from the very recent RatioRF distance for clustering [3], which has shown to outperform all other distance-based RF clustering schemes, extending the framework in two directions, which allow the integration of missing data mechanisms directly inside the clustering pipeline. Experimental results, based on 6 standard UCI ML datasets, are promising, also in comparison with some literature alternatives.

Distance-Based Random Forest Clustering with~Missing Data

Bicego, M.;Cicalese, F.
2022-01-01

Abstract

In recent years there has been an increased interest in clustering methods based on Random Forests, due to their flexibility and their capability in describing data. One problem of current RF-clustering approaches is that they are not able to directly deal with missing data, a common scenario in many application fields (e.g. Bioinformatics): the usual solution in this case is to pre-impute incomplete data before running standard clustering methods. In this paper we present the first Random Forest clustering approach able to directly deal with missing data. We start from the very recent RatioRF distance for clustering [3], which has shown to outperform all other distance-based RF clustering schemes, extending the framework in two directions, which allow the integration of missing data mechanisms directly inside the clustering pipeline. Experimental results, based on 6 standard UCI ML datasets, are promising, also in comparison with some literature alternatives.
2022
978-3-031-06432-6
Missing data; Random Forest clustering; Ratio RF distance
File in questo prodotto:
File Dimensione Formato  
paper_v3.pdf

solo utenti autorizzati

Tipologia: Documento in Post-print
Licenza: Copyright dell'editore
Dimensione 329.93 kB
Formato Adobe PDF
329.93 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1074150
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact