Cross-scene Crowd Counting via Deep Convolutional Neural Networks

Cong Zhang1,2, Hongsheng Li2, Xiaogang Wang2, and Xiaokang Yang1

1Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University

2Department of Informaiton Engineering, The Chinese University of Hong Kong.

[PDF] [Extended Abstract] [Poster] [Dataset]



Counting crowd pedestrians in videos draws a lot of attention because of its intense demands in video surveillance,and it is especially important for metropolis security. In this paper, we propose a framework for cross-scene crowd counting. No extra annotations are needed for a new target scene. Our goal is to learn a mapping from images to crowd counts, and then to use the mapping in unseen target scenes for cross-scene crowd counting. To achieve this goal, we need to overcome the following challenges. 1) Develop effective features to describe crowd. Previous works used general hand-crafted features, which have low representation capability for crowd. New descriptors specially designed or learned for crowd scenes are needed. 2) Different scenes have different perspective distortions, crowd distributions and lighting conditions. Without additional training data, the model trained in one specific scene has difficulty being used for other scenes. 3) For most recent works, foreground segmentation is indispensable for crowd counting. But crowd segmentation is a challenging problem and can not be accurately obtained in most crowded scenes. The scene may also have stationary crowd without movement. 4) Existing crowd counting datasets are not sufficient to support and evaluate cross-scene counting research.

Considering these challenges, we propose a Convolutional Neural Network (CNN) based framework for cross-scene crowd counting. After a CNN is trained with a fixed dataset, a data-driven method is introduced to fine-tune (adapt) the learned CNN to an unseen target scene, where training samples similar to the target scene are retrieved from the training scenes for fine-tuning. Our cross-scene crowd density estimation and counting framework has following advantages:

WorldExpo’10 Crowd Counting Dataset

We introduce a new large-scale cross-scene crowd counting dataset. To the best of our knowledge, this is the largest dataset focusing on cross-scene counting. It includes 1132 annotated video sequences captured by 108 surveillance cameras, all from Shanghai 2010 WorldExpo. Since most of the cameras have disjoint bird views, they cover a large variety of scenes.

  • Our dataset is splitted into two parts. 1,127 one-minute long video sequences out of 103 scenes are treated as training and validation sets. There are 3 labeled frames in each training video and the interval between two labeled frames is 15 seconds. Some samples are shown as below:

  • The test set has 5 one-hour long video sequences from 5 different scenes. There are 120 labeled frames in each test scene and the interval between two labeled frames is 30 seconds. Some samples are shown as below:

  • Description.
  • Reference and Acknowledgments

    Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. "Cross-scene Crowd Counting via Deep Convolutional Neural Networks". in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2015.  

    This work is partially supported by NSFC (No. 61025005, 61129001, 61221001, 61301269), STCSM (No. 14XD1402100, 13511504501), 111 Program (No. B07022), Sichuan High Tech R&D Program (No. 2014GZX0009), General Research Fund sponsored by the Research Grants Council of Hong Kong (No. CUHK419412, CUHK417011, CUHK14206114, CUHK14207814), Hong Kong Innovation and Technology Support Programme (No. ITS/221/13FP) and Shenzhen Basic Research Program (No. JCYJ20130402113127496).


    Contact Me

    If you have any questions, please feel free to contact me (

    Last update: June 5, 2015