During the last decade, the field of crowd analysis had a remarkable evolution from crowded scene understanding, including crowd behavior analysis, crowd tracking, and crowd segmentation. Much of this progress was sparked by the creation of crowd datasets as well as the new and robust features and models for profiling crowd intrinsic properties. Most of the above studies on crowd understanding are scene-specific, that is, the crowd model is learned from a specific scene and thus poor in generalization to describe other scenes. Attributes are particularly effective on characterizing generic properties across scenes. Indeed, attributes can express more information in a crowd video as they can describe a video by answering “Who is in the crowd?”, “Where is the crowd?”, and “Why is crowd here?”, but not merely define a categorical scene label or event label to it. For instance, an attribute-based representation might describe a crowd video as the “conductor” and “choir” perform on the “stage” with “audience” “applauding”, in contrast to a categorical label like “chorus”.
The contributions of this work:
- The largest crowd dataset with crowd attributes annotations - We establish a large-scale crowd dataset with 10,000 videos from 8,257 scenes. 94 crowd-related attributes are designed and annotated to describe each video in the dataset. It is the first time such a large set of attributes on crowd understanding is defined.
- Deeply learned features for crowd scene understanding - We develop a multi-task learning deep model to jointly learn appearance and motion features and effectively combine them. Instead of directly inputting multiple frames to a deep model to learn motion features as most existing works did for video analysis, we specially design crowd motion channels as the input of the deep model. The motion channels are inspired by generic properties of crowd systems, which have been well studied in biology and physics. With multi-task learning, the correlations among attributes are well captured when learning deep features.
- Extensive experiments evaluation and user study to explore the WWW dataset - They provide valuable insights on how static appearance cues and motion cues behave differently and complementarily on the three types of attributes: “Where”, “Who” and “Why”. It also shows that the features specifically learned for human crowds are more effective than state-of-the-art handcrafted features.
WWW Crowd Dataset
Most of the existing public crowd datasets contain only one or two specific scenes, and even the CUHK Crowd dataset merely provides 474 videos from 215 crowded scenes. On the contrary, our proposed WWW dataset provides 10,000 videos with over 8 million frames from 8,257 diverse scenes, therefore offering a superiorly comprehensive dataset for the area of crowd understanding. The abundant sources of these videos also enrich the diversity and completeness.
- The dataset can be download from either Baidu disk or Dropbox.
- Readme provides the details of the WWW Crowd Dataset used in our paper which are archived here, including the crowd attributes list, movie list, training/test/validation sets, and the scene labels.
- The groundtruth annotations for all the videos in the WWW dataset can be downloaded here.
- We extract the first representative frame from each video in the dataset to build a single frame set.
- These data can only be used for academic research purposes.
A quick glance of WWW Crowd Dataset (left) with its attributes (right). Red represents the location (Where), green represents the subject (Who), and blue refers to event/action (Why). The area of each word is proportional to the frequency of that attribute in the WWW dataset.
The preview of video samples of all the attributes in the WWW dataset.
Deeply Learned Crowd Features
The traditional input of deep model is a map of single frame (RGB channels) or multiple frames . Some well-known motion features like optical flow cannot well characterize motion patterns in crowded scenes, especially across different scenes. In this paper, we propose three scene-independent motion channels as the complement of the appearance channels, as shown in the right figure.
The first row gives an example to briefly illustrate three motion channels construction procedure. For each channel, two examples are shown in the second and third rows. Individuals in crowd moving randomly indicates low collectiveness, while the coherent motion of crowd reveals high collectiveness. Individuals have low stability if their topological structure changes a lot, whereas high stability if topological structure changes a little. Conflict occurs when individuals move towards different directions.
Reference and Acknowledgments
If you use our dataset, please cite our paper.
Jing Shao, Kai Kang, Chen Change Loy, and Xiaogang Wang. "Deeply learned attributes for crowded scene understanding". in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015, oral).
This work is partially supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Project Nos. CUHK 419412, CUHK 417011, CUHK 14206114, and CUHK 14207814), Hong Kong Innovation and Technology Support Programme (Project reference ITS/221/13FP), Shenzhen Basic Research Program (JCYJ20130402113127496), and a hardware donation from NVIDIA Corporation. Thank Lu Sheng and Tong Xiao for valuable discussions and support.
If you have any questions, please feel free to contact me (email@example.com).
update: May 18, 2015