Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a unified memory architecture allows the explicit data copying between CPU and integrated GPU (iGPU) to be eliminated with the benefit of significantly improving performance and energy savings. However, to enable such a "zero-copy" communication model, many devices either implement intricate cache coherence protocols or they may disable the last level caches. This often leads to strong performance degradation of cache-dependent applications, for which CPU-iGPU data transfer based on standard copy remains the best solution. This paper presents a framework based on a performance model, a set of micro-benchmarks, and a novel zero-copy communication pattern to accurately estimate the potential speedup a CPU-iGPU application may have by considering different communication models (i.e., standard copy, unified memory, or pinned "zero-copy"). It shows how the framework can be combined with standard profiler information to efficiently drive the application tuning for a given programmable embedded device.

A Framework for Optimizing CPU-iGPUCommunication on Embedded Platforms

F. Lumpp;Nicola Bombieri
Membro del Collaboration Group
2021-01-01

Abstract

Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a unified memory architecture allows the explicit data copying between CPU and integrated GPU (iGPU) to be eliminated with the benefit of significantly improving performance and energy savings. However, to enable such a "zero-copy" communication model, many devices either implement intricate cache coherence protocols or they may disable the last level caches. This often leads to strong performance degradation of cache-dependent applications, for which CPU-iGPU data transfer based on standard copy remains the best solution. This paper presents a framework based on a performance model, a set of micro-benchmarks, and a novel zero-copy communication pattern to accurately estimate the potential speedup a CPU-iGPU application may have by considering different communication models (i.e., standard copy, unified memory, or pinned "zero-copy"). It shows how the framework can be combined with standard profiler information to efficiently drive the application tuning for a given programmable embedded device.
2021
I/O cache coherence
CPU-GPU communication,
Edge computing
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1037404
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 4
social impact