CATALOGO DEI PRODOTTI DELLA RICERCA

We present a novel vision-based voice activity detection (VAD) method that relies only on automatic upper body mo- tion (UBM) analysis. Traditionally, VAD is performed using audio features only, but the use of visual cues instead of au- dio can be desirable especially when audio is not available such as due to technical, ethical or legal issues. Psychology literature confirms that the way people move while speak- ing is different from while they are not speaking. This mo- tivates us to claim that an effective representation of UBM can be used to detect “Who is Speaking and When”. On the other hand, the way people move during their speech varies a lot from culture to culture, and even person to person in the same culture. This results in unrelated UBM represen- tations, such that the distribution of training and test data becomes disparate. To overcome this, we combine stacked sparse autoencoders and simple subspace alignment meth- ods while a classifier is jointly learned using the VAD labels of the training data only. This yields new domain invariant feature representations for training and test data, showing improved VAD results. Our approach is applicable to any person without requiring re-training. The tests applied on a publicly available real-life VAD dataset show better results as compared to the state-of-the-art video-only VAD meth- ods. Moreover, the ablation study justifies the superiority of the proposed method and demonstrates the positive contri- bution of each component.

Voice activity detection by upper body motion analysis and unsupervised domain adaptation

Shahid, M.;Beyan, C.;Murino, V.

2019-01-01

Abstract

We present a novel vision-based voice activity detection (VAD) method that relies only on automatic upper body mo- tion (UBM) analysis. Traditionally, VAD is performed using audio features only, but the use of visual cues instead of au- dio can be desirable especially when audio is not available such as due to technical, ethical or legal issues. Psychology literature confirms that the way people move while speak- ing is different from while they are not speaking. This mo- tivates us to claim that an effective representation of UBM can be used to detect “Who is Speaking and When”. On the other hand, the way people move during their speech varies a lot from culture to culture, and even person to person in the same culture. This results in unrelated UBM represen- tations, such that the distribution of training and test data becomes disparate. To overcome this, we combine stacked sparse autoencoders and simple subspace alignment meth- ods while a classifier is jointly learned using the VAD labels of the training data only. This yields new domain invariant feature representations for training and test data, showing improved VAD results. Our approach is applicable to any person without requiring re-training. The tests applied on a publicly available real-life VAD dataset show better results as compared to the state-of-the-art video-only VAD meth- ods. Moreover, the ablation study justifies the superiority of the proposed method and demonstrates the positive contri- bution of each component.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Parole Chiave
	
				Voice activity detection, visual activity, dynamic images, optical flow, social interactions
			
	Appare nelle tipologie:
	
				04.01 Contributo in atti di convegno

File in questo prodotto:

File	Dimensione	Formato
IC17_Voice Activity Detection by Upper Body Motion Analysis and Unsupervised.pdf solo utenti autorizzati Tipologia: Versione dell'editore Licenza: Copyright dell'editore Dimensione 1.11 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.11 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1121842

Citazioni

ND

14

ND

social impact