Little is known about how the quality of a clustering changes when changing the size of the set used to determine the clustering model. We show that, for K-means clustering, the relationship between dataset size and clustering quality can display counterintuitive behavior. Notably, the quality can significantly deteriorate with more data to build the model. More generally, using artificial datasets and data from bioinformatics, we uncover a variety of learning curve behaviors for K-means. Our results clearly illustrate that the training sample size can have a nontrivial influence on the clustering performance. Our findings should appeal to both the clustering practitioner and the clustering researcher concerned with developing basic insights.

Counterintuitive Behavior of Clustering Quality: Findings for K-Means on Synthetic and Real Data

Loog, Marco;Bicego, Manuele
2025-01-01

Abstract

Little is known about how the quality of a clustering changes when changing the size of the set used to determine the clustering model. We show that, for K-means clustering, the relationship between dataset size and clustering quality can display counterintuitive behavior. Notably, the quality can significantly deteriorate with more data to build the model. More generally, using artificial datasets and data from bioinformatics, we uncover a variety of learning curve behaviors for K-means. Our results clearly illustrate that the training sample size can have a nontrivial influence on the clustering performance. Our findings should appeal to both the clustering practitioner and the clustering researcher concerned with developing basic insights.
2025
9783031913976
Clustering quality K-means Counterintuitive behavior Monotonicity Gene ontology enrichment analysis
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1161709
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact