This article explores the use of deep learning to choose an appropriate spatial partitioning technique for big data. The exponential increase in the volumes of spatial datasets resulted in the development of big spatial data frameworks. These systems need to partition the data across machines to be able to scale out the computation. Unfortunately, there is no current method to automatically choose an appropriate partitioning technique based on the input data distribution. This article addresses this problem by using deep learning to train a model that captures the relationship between the data distribution and the quality of the partitioning techniques.We propose a solution that runs in two phases, training and application. The offline training phase generates synthetic data based on diverse distributions, partitions them using six different partitioning techniques, and measures their quality using four quality metrics. At the same time, it summarizes the datasets using a histogram and well-designed skewness measures. The data summaries and the quality metrics are then use to train a deep learning model. The second phase uses this model to predict the best partitioning technique given a new dataset that needs to be partitioned.We run an extensive experimental evaluation on big spatial data, andwe experimentally showthe applicability of the proposed technique.We showthat the proposed model outperforms the baseline method in terms of accuracy for choosing the best partitioning technique by only analyzing the summary of the datasets.

Using Deep Learning for Big Spatial Data Partitioning

Alberto Belussi;Sara Migliorini;
2020-01-01

Abstract

This article explores the use of deep learning to choose an appropriate spatial partitioning technique for big data. The exponential increase in the volumes of spatial datasets resulted in the development of big spatial data frameworks. These systems need to partition the data across machines to be able to scale out the computation. Unfortunately, there is no current method to automatically choose an appropriate partitioning technique based on the input data distribution. This article addresses this problem by using deep learning to train a model that captures the relationship between the data distribution and the quality of the partitioning techniques.We propose a solution that runs in two phases, training and application. The offline training phase generates synthetic data based on diverse distributions, partitions them using six different partitioning techniques, and measures their quality using four quality metrics. At the same time, it summarizes the datasets using a histogram and well-designed skewness measures. The data summaries and the quality metrics are then use to train a deep learning model. The second phase uses this model to predict the best partitioning technique given a new dataset that needs to be partitioned.We run an extensive experimental evaluation on big spatial data, andwe experimentally showthe applicability of the proposed technique.We showthat the proposed model outperforms the baseline method in terms of accuracy for choosing the best partitioning technique by only analyzing the summary of the datasets.
2020
Deep learning
Data synopsis
Skewed data
Spatial partitioning
File in questo prodotto:
File Dimensione Formato  
Using_Deep_Learning_for_Big_Spatial_Data_Partitioning.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 7.71 MB
Formato Adobe PDF
7.71 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1031363
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 9
social impact