1 Tianjin University 2 Cardiff University
* Corresponding author
Recovering scene-consistent 4D crowd motion from monocular video in large-scale scenes remains challenging due to severe depth ambiguity and complex scene geometry. Existing monocular crowd reconstruction methods typically rely on single-plane assumptions, leading to unreliable metric scale and spatial drift under complex terrain. We propose Crowd4D, the first scene-aware 4D crowd reconstruction framework that jointly optimizes the crowd and scene from a monocular RGB video in large-scale scenes. Crowd4D explicitly incorporates scene geometry and ensures consistency across image and scene spaces via a multi-stage optimization strategy. We introduce the Human-Scene Interaction Proxy (HSIP), derived from Scene Interaction Point Clouds and a Scene Interaction Surface (SIPC&SIS), to encode scene-aware geometric priors and redefine the optimization space. We further introduce Crowd Structural Coherence Regularization (CSCR), which uses HSIP-based spatial priors to impose soft temporal consistency on pairwise relative displacements and directions within local crowd neighborhoods. Extensive experiments demonstrate that Crowd4D outperforms existing state-of-the-art methods and enables robust monocular 4D crowd reconstruction in complex, large-scale real-world scenes.
Overview of the Crowd4D framework. Our method starts from crowd and scene initialization; however, the initial results exhibit significant deviations in scene scale and individual spatial locations. To address this issue, we introduce human-scene interaction processing, where geometric constraints are imposed via Scene Interaction Point Clouds and a Scene Interaction Surface (SIPC&SIS) to construct Human-Scene Interaction Proxies (HSIP) that guide subsequent optimization. Finally, with crowd structure, geometry, and motion priors, a multi-stage joint optimization is performed to obtain scene-consistent and pixel-consistent 4D crowd reconstruction results.
Qualitative comparison with DyCrowd on VirtualCrowd and PANDA.
Hongbo Kang, Tianyi Zhou, Qingyang Yang, Hongwei Wen, Jing Huang, Yu-Kun Lai and Kun Li. "Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction". International Conference on Machine Learning, 2026.
@inproceedings{kang2026crowd4d,
author = {Hongbo Kang, Tianyi Zhou, Qingyang Yang, Hongwei Wen, Jing Huang, Yu-Kun Lai and Kun Li},
title = {Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction},
booktitle = {International Conference on Machine Learning},
year={2026}
}