Social robots are increasingly employed as personalized coaches in educational settings, offering new opportunities for applications such as public speaking training. In this domain, emotional self-regulation plays a crucial role, especially for students presenting in a non-native language. This study proposes a novel pipeline for detecting public speaking anxiety (PSA) using multimodal emotion recognition. Unlike traditional datasets that typically rely on acted emotions, we consider spontaneous data from students interacting naturally with a social robot coach. Emotional labels are generated through knowledge distillation, enabling the creation of soft labels that reflect the emotional valence of each presentation. We introduce a lightweight multimodal model that integrates speech prosody and body posture to classify speakers by anxiety level, without relying on linguistic content. Evaluated on a collected dataset of student presentations, the system achieves 74.67% accuracy and an F1-score of 0.64. The model can operate completely disconnected from the transmission network on an NVIDIA Jetson board, safeguarding data privacy and demonstrating its feasibility for real-world deployment.

A Deep Learning-Based Emotion Recognition Pipeline for Public Speaking Anxiety Detection in Social Robotics

Michele Boldo
;
Nicola Bombieri;
2025-01-01

Abstract

Social robots are increasingly employed as personalized coaches in educational settings, offering new opportunities for applications such as public speaking training. In this domain, emotional self-regulation plays a crucial role, especially for students presenting in a non-native language. This study proposes a novel pipeline for detecting public speaking anxiety (PSA) using multimodal emotion recognition. Unlike traditional datasets that typically rely on acted emotions, we consider spontaneous data from students interacting naturally with a social robot coach. Emotional labels are generated through knowledge distillation, enabling the creation of soft labels that reflect the emotional valence of each presentation. We introduce a lightweight multimodal model that integrates speech prosody and body posture to classify speakers by anxiety level, without relying on linguistic content. Evaluated on a collected dataset of student presentations, the system achieves 74.67% accuracy and an F1-score of 0.64. The model can operate completely disconnected from the transmission network on an NVIDIA Jetson board, safeguarding data privacy and demonstrating its feasibility for real-world deployment.
2025
Social Intelligence for Robots
Affective Computing
Machine Learning and Adaptation
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1170309
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact