Facial emotion recognition is a valuable tool in healthcare, providing insights into emotional well-being, developmental progress, and health-related behaviors. This study presents a novel framework integrating deep learning with explainable artificial intelligence (XAI) to enhance emotion recognition from video data. Using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the framework begins with preprocessing, where 3D face meshes with 478 landmarks are generated using MediaPipe, and regions of interest (ROI) are extracted. Data augmentation techniques, including rotation, scaling, and translation, improve dataset variability. Feature extraction is performed using a fine-tuned Xception deep convolutional neural network, followed by classification using supervised machine learning algorithms such as SVM, KNN, ensemble methods, and ANN. Among these, the Fine Gaussian SVM (FGSVM) achieved the highest performance, with 93.87 % accuracy on both validation and test sets. The validation precision, recall, and F1-score were 94.06 %, 93.79 %, and 93.93 %, respectively, while the test set recorded 94.01 %, 93.74 %, and 93.88 %. To ensure interpretability, XAI techniques such as Grad-CAM, LIME, sensitivity occlusion, and SHAP highlight crucial facial landmarks and temporal frames influencing predictions. This study underscores the potential of combining deep learning with XAI to enhance reliability in healthcare applications, improving clinical decision-making, mental health monitoring, and human-computer interaction. A Python-based implementation of the proposed framework is available at: 10.5281/zenodo.14809940.
Explainable Emotion Recognition Using Xception-Based Feature Extraction and Supervised Machine Learning on the RAVDESS Dataset
Buccoliero, Andrea;
2025-01-01
Abstract
Facial emotion recognition is a valuable tool in healthcare, providing insights into emotional well-being, developmental progress, and health-related behaviors. This study presents a novel framework integrating deep learning with explainable artificial intelligence (XAI) to enhance emotion recognition from video data. Using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the framework begins with preprocessing, where 3D face meshes with 478 landmarks are generated using MediaPipe, and regions of interest (ROI) are extracted. Data augmentation techniques, including rotation, scaling, and translation, improve dataset variability. Feature extraction is performed using a fine-tuned Xception deep convolutional neural network, followed by classification using supervised machine learning algorithms such as SVM, KNN, ensemble methods, and ANN. Among these, the Fine Gaussian SVM (FGSVM) achieved the highest performance, with 93.87 % accuracy on both validation and test sets. The validation precision, recall, and F1-score were 94.06 %, 93.79 %, and 93.93 %, respectively, while the test set recorded 94.01 %, 93.74 %, and 93.88 %. To ensure interpretability, XAI techniques such as Grad-CAM, LIME, sensitivity occlusion, and SHAP highlight crucial facial landmarks and temporal frames influencing predictions. This study underscores the potential of combining deep learning with XAI to enhance reliability in healthcare applications, improving clinical decision-making, mental health monitoring, and human-computer interaction. A Python-based implementation of the proposed framework is available at: 10.5281/zenodo.14809940.File | Dimensione | Formato | |
---|---|---|---|
Explainable Emotion Recognition Using Xception- Based Feature Extraction and Supervised Machine Learning on the RAVDESS Dataset.pdf
solo utenti autorizzati
Descrizione: full-text paper
Tipologia:
Versione dell'editore
Licenza:
Copyright dell'editore
Dimensione
1.69 MB
Formato
Adobe PDF
|
1.69 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.