Social robots are increasingly employed as personalized coaches in educational settings, offering new opportunities for applications such as public speaking training. In this domain, emotional self-regulation plays a crucial role, especially for students presenting in a non-native language. This study proposes a novel pipeline for detecting public speaking anxiety (PSA) using multimodal emotion recognition. Unlike traditional datasets that typically rely on acted emotions, we consider spontaneous data from students interacting naturally with a social robot coach. Emotional labels are generated through knowledge distillation, enabling the creation of soft labels that reflect the emotional valence of each presentation. We introduce a lightweight multimodal model that integrates speech prosody and body posture to classify speakers by anxiety level, without relying on linguistic content. Evaluated on a collected dataset of student presentations, the system achieves 74.67% accuracy and an F1-score of 0.64. The model can operate completely disconnected from the transmission network on an NVIDIA Jetson board, safeguarding data privacy and demonstrating its feasibility for real-world deployment.
A Deep Learning-Based Emotion Recognition Pipeline for Public Speaking Anxiety Detection in Social Robotics
Michele Boldo
;Nicola Bombieri;
2025-01-01
Abstract
Social robots are increasingly employed as personalized coaches in educational settings, offering new opportunities for applications such as public speaking training. In this domain, emotional self-regulation plays a crucial role, especially for students presenting in a non-native language. This study proposes a novel pipeline for detecting public speaking anxiety (PSA) using multimodal emotion recognition. Unlike traditional datasets that typically rely on acted emotions, we consider spontaneous data from students interacting naturally with a social robot coach. Emotional labels are generated through knowledge distillation, enabling the creation of soft labels that reflect the emotional valence of each presentation. We introduce a lightweight multimodal model that integrates speech prosody and body posture to classify speakers by anxiety level, without relying on linguistic content. Evaluated on a collected dataset of student presentations, the system achieves 74.67% accuracy and an F1-score of 0.64. The model can operate completely disconnected from the transmission network on an NVIDIA Jetson board, safeguarding data privacy and demonstrating its feasibility for real-world deployment.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.