Thanks to the availability of a huge amount of spatial data, many new machine and deep learning (ML/DL) applications have emerged that are able to deal with such kind of information. In particular, new cost models have been developed with the aim of predicting the cost of spatial operations carefully. For obtaining good ML/DL models, the training activity is usually performed with synthetically generated datasets that capture as many spatial distributions as possible and as many combinations of features as desired (e.g., cardinality, geometry complexity, etc), with the aim to improve the generalization capabilities of the trained models. However, when a model is used to estimate some properties of a spatial operation, like the range query selectivity, balancing the characteristics of the input datasets could be not enough to guarantee a balancing in the ground truth values of the target variable. Therefore, we need to develop a way to balance the final results without recomputing the operation from scratch. This paper formalizes the notion of dataset balancing in the context of spatial ML/DL, proposes a set of metrics for evaluating the degree of balancing of the input domains and the target values, and defines a set of augmentation techniques specifically tailored for spatial data. Finally, it tests the effects of such augmentations in the training of a generic ML cost model for estimating the selectivity of spatial range query.
Augmentation Techniques for Balancing Spatial Datasets in Machine and Deep Learning Applications
Belussi, Alberto;Migliorini, Sara
2024-01-01
Abstract
Thanks to the availability of a huge amount of spatial data, many new machine and deep learning (ML/DL) applications have emerged that are able to deal with such kind of information. In particular, new cost models have been developed with the aim of predicting the cost of spatial operations carefully. For obtaining good ML/DL models, the training activity is usually performed with synthetically generated datasets that capture as many spatial distributions as possible and as many combinations of features as desired (e.g., cardinality, geometry complexity, etc), with the aim to improve the generalization capabilities of the trained models. However, when a model is used to estimate some properties of a spatial operation, like the range query selectivity, balancing the characteristics of the input datasets could be not enough to guarantee a balancing in the ground truth values of the target variable. Therefore, we need to develop a way to balance the final results without recomputing the operation from scratch. This paper formalizes the notion of dataset balancing in the context of spatial ML/DL, proposes a set of metrics for evaluating the degree of balancing of the input domains and the target values, and defines a set of augmentation techniques specifically tailored for spatial data. Finally, it tests the effects of such augmentations in the training of a generic ML cost model for estimating the selectivity of spatial range query.File | Dimensione | Formato | |
---|---|---|---|
sigspatial_2024.pdf
accesso aperto
Licenza:
Creative commons
Dimensione
3.07 MB
Formato
Adobe PDF
|
3.07 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.