In recent years several extensions of Hadoop system have been proposed for dealing with spatial data and SpatialHadoop belongs to this group. In the MapReduce paradigm a task can be parallelized by partitioning data into chunks and performing the same operation on them, eventually combining the partial results at the end. Thus, the applied partitioning technique can tremendously affect the performance of a parallel execution, since it is the key point for obtaining balanced map tasks. However, when skewed distributed datasets are considered, using a regular grid might not be the right choice and other techniques have to be applied, which in turn are more expensive to build. This paper illustrates an approach for detecting the degree of skewness of a spatial dataset, based on the box counting function. Moreover, given the degree of skewness and some experimental observations, a heuristic is sketched in order to decide which partitioning technique to apply in order to improve as much as possible the performance of subsequent operations.

Detecting skewness of big spatial data in SpatialHadoop

Alberto Belussi
;
Sara Migliorini;
2018-01-01

Abstract

In recent years several extensions of Hadoop system have been proposed for dealing with spatial data and SpatialHadoop belongs to this group. In the MapReduce paradigm a task can be parallelized by partitioning data into chunks and performing the same operation on them, eventually combining the partial results at the end. Thus, the applied partitioning technique can tremendously affect the performance of a parallel execution, since it is the key point for obtaining balanced map tasks. However, when skewed distributed datasets are considered, using a regular grid might not be the right choice and other techniques have to be applied, which in turn are more expensive to build. This paper illustrates an approach for detecting the degree of skewness of a spatial dataset, based on the box counting function. Moreover, given the degree of skewness and some experimental observations, a heuristic is sketched in order to decide which partitioning technique to apply in order to improve as much as possible the performance of subsequent operations.
2018
978-1-4503-5889-7
SpatialHadoop
Skewed data
Partitioning
MapReduce
BigData
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/987310
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 25
  • ???jsp.display-item.citation.isi??? 13
social impact