In Computer Vision, automated pedestrian detection is surely one of the hottest topics, with important applications in surveillance and security. To this end, information integration from different imaging modalities, such as thermal infrared and visible spectrum, can significantly improve the detection rate with respect to monomodal strategies. A common scheme consists of extracting two sets of features, from thermal and visible images of the same scene respectively, and stacking them together into a single feature set, ignoring possible and meaningful inter-media dependencies. Here we propose a fusion scheme which acts at the feature-level, taking standard pixel characteristics (such as first/second order spatial derivatives or Local Binary Pattern) and designing a composite descriptor that, at the same time, encodes the information coming from the separate modalities, as well as the cross-modal mutual relationships in the form of covariances. The descriptor, which lies on a Riemannian manifold, is projected onto a Euclidean tangent space and then fed into a Support Vector Machine classifier. Experiments performed on the OTCBVS dataset [1], and validated statistically, demonstrate that our method outperforms significantly the single modality policies as well as different fusion schemes at the pixel, feature and decision level.
Low-level multimodal integration on Riemannian manifolds for automatic pedestrian detection
CRISTANI, Marco;MARTELLI, Samuele;MURINO, Vittorio
2012-01-01
Abstract
In Computer Vision, automated pedestrian detection is surely one of the hottest topics, with important applications in surveillance and security. To this end, information integration from different imaging modalities, such as thermal infrared and visible spectrum, can significantly improve the detection rate with respect to monomodal strategies. A common scheme consists of extracting two sets of features, from thermal and visible images of the same scene respectively, and stacking them together into a single feature set, ignoring possible and meaningful inter-media dependencies. Here we propose a fusion scheme which acts at the feature-level, taking standard pixel characteristics (such as first/second order spatial derivatives or Local Binary Pattern) and designing a composite descriptor that, at the same time, encodes the information coming from the separate modalities, as well as the cross-modal mutual relationships in the form of covariances. The descriptor, which lies on a Riemannian manifold, is projected onto a Euclidean tangent space and then fed into a Support Vector Machine classifier. Experiments performed on the OTCBVS dataset [1], and validated statistically, demonstrate that our method outperforms significantly the single modality policies as well as different fusion schemes at the pixel, feature and decision level.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.