Scalable Safe Policy Improvement for Single and Multi-Agent Systems

Bianchi, Federico

Reinforcement Learning (RL) has achieved remarkable success in domains such as robotics and strategic games, but deploying RL in real-world applications presents challenges due to computational complexity, safety, and sample efficiency. Ensuring safety is crucial, especially in critical applications where unreliable policies can pose significant risks. Offline RL mitigates some issues by training policies on pre-collected datasets from a baseline policy. However, it faces distributional shifts and extrapolation errors, making algorithms unreliable when deployed in specific applications. Safe Policy Improvement (SPI) addresses some of these challenges by generating improved policies with probabilistic performance guarantees compared to a given baseline policy. Despite promising results, existing SPI algorithms face scalability challenges, particularly in large-scale single-agent and multi-agent problems, due to the combinatorial explosion of state and action spaces. This thesis develops scalable SPI methods for single-agent and multi-agent systems. For single-agent systems, we introduce MCTS-SPIBB, integrating Monte Carlo Tree Search (MCTS) in the framework of SPI with Baseline Bootstrapping (SPIBB) to improve scalability focusing on reachable states. We also present Scalable Dynamic Programming SPIBB (SDP-SPIBB), an algorithm that scales SPIBB by applying dynamic programming to relevant state-action subspaces. Both methods are proven to converge asymptotically to optimal and safe policies. Empirical evaluations demonstrate their effectiveness in large-scale environments where the state-of-the-art SPI methods cannot work. For multi-agent systems, we propose Factored Value MCTS-SPIBB (FV-MCTS-SPIBB), the first SPI algorithm capable of scaling to domains with exponentially large joint state-action spaces while maintaining safety guarantees. It leverages problem factorizability and introduces two novel action selection strategies, Constrained Max-Plus and Constrained Variable Elimination, for tractable cooperative action selection that avoids evaluating all possible joint actions. Theoretical analysis confirms convergence to optimal and safe policies, and empirical results demonstrate scalability and reliability on large multi-agent domains. Finally, this thesis proposes a planning method for continuous action spaces based on MCTS. The approach called MCTS-APW2 extends the Action Progressive Widening algorithm and enhances sampling by directing exploration toward higher-value actions. Empirical evaluations show that MCTS-APW2 outperforms state-of-the-art algorithms, effectively selecting optimal actions with low error margins. In summary, this thesis contributes to the development of scalable solutions for SPI in both single-agent and multi-agent settings and provides advances in MCTS planning for continuous action spaces. These methodologies address critical challenges in deploying RL to real-world applications, ensuring that agents operate safely and reliably in complex and uncertain environments.