Safe Policy Improvement (SPI) is crucial in domains where reliable decision-making must be achieved with limited environmental interaction, given the high costs and risks involved. Although existing SPI algorithms ensure improved safety over baseline policies, they struggle to scale to large and complex problems. In this work, we discuss new approaches to enhance the scalability and safety of SPI for both single-agent and multi-agent systems. For single-agent scenarios, we introduce MCTS-SPIBB, which combines Monte Carlo Tree Search with Safe Policy Improvement with Baseline Bootstrapping, and SDP-SPIBB, a scalable dynamic programming approach that extends SPI to large domains while preserving safety guarantees. For multi-agent settings, we present Factored Value-MCTS-SPIBB, the first SPI method to address large-scale multi-agent problems effectively. Through theoretical and empirical evaluation, we show that our algorithms scale efficiently and maintain the safety properties of SPI, thus making SPI applicable to complex and large-scale scenarios.

Scalable Safe Policy Improvement for Single and Multi-Agent Systems

Federico Bianchi;Alberto Castellini;Alessandro Farinelli
2025-01-01

Abstract

Safe Policy Improvement (SPI) is crucial in domains where reliable decision-making must be achieved with limited environmental interaction, given the high costs and risks involved. Although existing SPI algorithms ensure improved safety over baseline policies, they struggle to scale to large and complex problems. In this work, we discuss new approaches to enhance the scalability and safety of SPI for both single-agent and multi-agent systems. For single-agent scenarios, we introduce MCTS-SPIBB, which combines Monte Carlo Tree Search with Safe Policy Improvement with Baseline Bootstrapping, and SDP-SPIBB, a scalable dynamic programming approach that extends SPI to large domains while preserving safety guarantees. For multi-agent settings, we present Factored Value-MCTS-SPIBB, the first SPI method to address large-scale multi-agent problems effectively. Through theoretical and empirical evaluation, we show that our algorithms scale efficiently and maintain the safety properties of SPI, thus making SPI applicable to complex and large-scale scenarios.
2025
Safe policy improvement
Reinforcement Learning
Single-agent systems
Multi-agent systems
File in questo prodotto:
File Dimensione Formato  
2025_AIRO2025_ScalableSPI.pdf

accesso aperto

Descrizione: Articolo
Tipologia: Versione dell'editore
Licenza: Dominio pubblico
Dimensione 300.7 kB
Formato Adobe PDF
300.7 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1186988
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact