Crowd4D Project Page

ICML 2026

Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction

Hongbo Kang¹, Tianyi Zhou¹, Qingyang Yang¹, Hongwei Wen¹, Jing Huang¹, Yu-Kun Lai², Kun Li^1*

¹ Tianjin University ² Cardiff University

^* Corresponding author

[Paper] [Arxiv Coming Soon] [Code]

Abstract

Recovering scene-consistent 4D crowd motion from monocular video in large-scale scenes remains challenging due to severe depth ambiguity and complex scene geometry. Existing monocular crowd reconstruction methods typically rely on single-plane assumptions, leading to unreliable metric scale and spatial drift under complex terrain. We propose Crowd4D, the first scene-aware 4D crowd reconstruction framework that jointly optimizes the crowd and scene from a monocular RGB video in large-scale scenes. Crowd4D explicitly incorporates scene geometry and ensures consistency across image and scene spaces via a multi-stage optimization strategy. We introduce the Human-Scene Interaction Proxy (HSIP), derived from Scene Interaction Point Clouds and a Scene Interaction Surface (SIPC&SIS), to encode scene-aware geometric priors and redefine the optimization space. We further introduce Crowd Structural Coherence Regularization (CSCR), which uses HSIP-based spatial priors to impose soft temporal consistency on pairwise relative displacements and directions within local crowd neighborhoods. Extensive experiments demonstrate that Crowd4D outperforms existing state-of-the-art methods and enables robust monocular 4D crowd reconstruction in complex, large-scale real-world scenes.

Method

Overview of the Crowd4D framework. Our method starts from crowd and scene initialization; however, the initial results exhibit significant deviations in scene scale and individual spatial locations. To address this issue, we introduce human-scene interaction processing, where geometric constraints are imposed via Scene Interaction Point Clouds and a Scene Interaction Surface (SIPC&SIS) to construct Human-Scene Interaction Proxies (HSIP) that guide subsequent optimization. Finally, with crowd structure, geometry, and motion priors, a multi-stage joint optimization is performed to obtain scene-consistent and pixel-consistent 4D crowd reconstruction results.

Results

Qualitative comparison with DyCrowd on VirtualCrowd and PANDA.

Technical Paper

Citation

Hongbo Kang, Tianyi Zhou, Qingyang Yang, Hongwei Wen, Jing Huang, Yu-Kun Lai and Kun Li. "Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction". International Conference on Machine Learning, 2026.

@inproceedings{kang2026crowd4d,
  author = {Hongbo Kang, Tianyi Zhou, Qingyang Yang, Hongwei Wen, Jing Huang, Yu-Kun Lai and Kun Li},
  title = {Crowd4D: Scene-Aware Monocular 4D Crowd Reconstruction},
  booktitle = {International Conference on Machine Learning},
  year={2026}
}