Slicing convolutional neural network for crowd video understanding
Refereed conference paper presented and published in conference proceedings

Full Text

Times Cited

Other information
AbstractLearning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio- and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55% from 51.84% [21]).
All Author(s) ListShao J., Loy C.C., Kang K., Wang X.
Name of Conference2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
Start Date of Conference26/06/2016
End Date of Conference01/07/2016
Place of ConferenceLas Vegas
Country/Region of ConferenceUnited States of America
Detailed descriptionorganized by IEEE,
Volume Number2016-January
Pages5620 - 5628
LanguagesEnglish-United Kingdom

Last updated on 2020-06-09 at 01:17