Crowd3D: Towards Hundreds of People Reconstruction from a Single Image

CVPR 2023

Crowd3D: Towards Hundreds of People Reconstruction from a Single Image

Hao Wen^1,†, Jing Huang^1,†, Huili Cui¹, Haozhe Lin², Yu-Kun Lai³, Lu Fang², Kun Li^1,*

¹ Tianjin University ² Tsinghua University ³ Cardiff University

^† Equal contribution ^* Corresponding author

[Paper] [Supplemental] [Code]

Abstract

Image-based multi-person reconstruction in wide-field large scenes is critical for crowd analysis and security alert. However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution. In this paper, we propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. The core of our approach is to convert the problem of complex crowd localization into pixel localization with the help of our newly defined concept, Human-scene Virtual Interaction Point (HVIP). To reconstruct the crowd with global consistency, we propose a progressive reconstruction network based on HVIP by pre-estimating a scene-level camera and a ground plane. To deal with a large number of persons and various human sizes, we also design an adaptive human-centric cropping scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene. Experimental results demonstrate the effectiveness of the proposed method.

Method

Overview of Crowd3D framework.

3D spatial localization by Human-scene Virtual Interaction Point(HVIP).

Results in the LargeCrowd

In the following comparison experiment, the two baselines are with “-large” suffix for they are modified from the corresponding state-of-the-art methods, as no existing methods can directly handle large-scene images with hundreds of people by the time of our submission.

Test on PANDA

PANDA^* is the first public gigapixel-level human-centric video dataset, but without human pose and position labels. The camera parameters in PANDA are unknown and are different from those in LargeCrowd.

^*Wang, et al. PANDA: A gigapixel-level human-centric video dataset. CVPR. 2020.

Proposed Dataset: LargeCrowd

We also construct LargeCrowd, a benchmark dataset with over 100K labeled humans (2D bounding boxes, 2D keypoints, 3D ground plane and HVIPs) in 733 gigapixelimages (19200×6480) of 9 different scenes.

Technical Paper

Citation

Hao Wen, Jing Huang, Huili Cui, Haozhe Lin, Yu-Kun Lai, Lu Fang, Kun Li. "Crowd3D: Towards Hundreds of People Reconstruction from a Single Image". In Proc. CVPR, 20233, pp. 8937-8946.

@inproceedings{Crowd3D,
  author = {Wen, Hao and Huang, Jing and Cui, Huili and Lin, Haozhe and Lai, Yu-Kun and Fang, Lu and Li, Kun},
  title = {Crowd3D: Towards Hundreds of People Reconstruction from a Single Image},
  booktitle = {CVPR},
  year={2023},
  pages={8937-8946}
}