Kun Li

Kun Li 李坤

中文版Chinese Version

Professor

College of Intelligence and Computing,

Tianjin University (Peiyang University) , Tianjin 300350, China

Email: lik@tju.edu.cn

Our codes are available at: https://github.com/3DV-TJU

News！！！：

PPT of annual progress report in the field of digital humans (talk in China3DV 2024) is available here.

Selected Publications

RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

ICCV, 2025

Code

Crowd evacuation simulation is critical for enhancing public safety, and demanded for realistic virtual environments. However, existing methods fail to generate reasonable, personalized and real-time evacuation motions. In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose a real-time 3D crowd evacuation simulation framework that integrates a 3D-adaptive SFM (Social Force Model) Decision Mechanism and a Personalized Gait Control Motor. This framework allows multiple agents to move in parallel and is suitable for various scenarios, with dynamic crowd awareness. Additionally, we introduce Part-level Force Visualization to assist in evacuation analysis.

FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image

arXiv, 2024.12

Code

We extend FOF by: 1) designing a new reconstruction framework greatly mitigating the performance degradation caused by texture and lighting effects; 2) proposing a robust mesh-to-FOF conversion algorithm with an automaton-based discontinuity matcher to enable real-time execution and significantly improve the system's robustness when facing challenging poses; and 3) proposing a FOF-to-mesh algorithm with a Laplacian coordinate constraint for greater robustness and fidelity.

EMP3D: An Emergency Medical Procedures 3D Dataset with Pose and Shape

Frontiers of Computer Science, 2025

Dataset

Emergency Medical Services play a critical role in acute emergencies, yet their effectiveness is often limited by professional complexities. Existing datasets for medical movement analysis mainly focus on basic patient actions like lying and standing, and lack 3D poses. We propose EMP3D, a new dataset capturing the intricate movements of rescuers during emergency procedures with 3D poses and shapes. This dataset will help enhance emergency response training and improve rescue skills.

AI System Facilitates People with Blindness and Low Vision in Interpreting and Experiencing Unfamiliar Environments

npj Artificial Intelligence, 2025

Code

Engaging with nature significantly enhances well-being, yet millions of individuals with blindness and low vision (BLV) are often excluded from these benefits due to constrained environmental perception. Here, we introduce VIPTour, an AI-driven system powered by the FocusFormer algorithm, which transforms complex scenes into structured, personalized graphs using tailored attention mechanisms and a BLV-in-the-Loop Adapter.

SpeechAct: Towards Generating Whole-body Motion from Speech

IEEE TVCG, 2025

Code

In this paper, we introduce a novel method, named SpeechAct, based on a hybrid point representation and contrastive motion learning to boost realism and diversity in whole-body motion generation from speech. Our method can be generalized to other languages, and the generated motion can be used to animate reconstructed avatars.

Real-time 3D Human Reconstruction and Rendering System from a Single RGB Camera

SIGGRAPH Asia 2024 Technical Communications

Transforming 2D human images into 3D appearance is essential for immersive communication. In this paper, we introduce a low-cost real-time 3D human reconstruction and rendering system with a single RGB camera at 28+ FPS, which guarantees both real-time computing speed and realistic rendering results. It can be applied to 3D holographic displays and virtual reality environments.

DualAvatar: Robust Gaussian Splatting Avatar with Dual Representation

SIGGRAPH Asia Poster 2024

R²Human: Real-Time 3D Human Appearance Rendering from a Single Image

ISMAR, 2024

Code

Reconstructing 3D human appearance from a single image is crucial for achieving holographic communication and immersive social experiences. However, this remains a challenge for existing methods, which typically rely on multi-camera setups or are limited to offline operations. In this work, we propose R²Human, the first approach for real-time inference and rendering of photorealistic 3D human appearance from a single image.

HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

ISMAR, 2024

Code

The generation of 3D clothed humans has attracted increasing attention in recent years. However, existing work cannot generate layered high-quality 3D humans with consistent body structures. As a result, these methods are unable to arbitrarily and separately change and edit the body and clothing of the human. In this work, we propose a text-driven layered 3D human generation framework based on a novel physically-decoupled semantic-aware diffusion model.

LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging

CVPR, 2024

Code

Lensless imaging system can have a smaller size, simpler structure and stronger privacy protection attributes, and hence can be adapted to a variety of complex environments. In this paper, we propose the first end-to-end framework to recover 3D human poses and shapes from lensless measurements, to our best knowledge.

Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints

CVPR, 2024

Code

3D human generation is increasingly significant in various applications. However, the direct use of 2D generative methods in 3D generation often results in significant loss of local details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce Joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly, ensuring both global structure and local details.

High-Quality Animatable Dynamic Garment Reconstruction from Monocular Videos

IEEE TCSVT, 2024

Code

Much progress has been made in reconstructing garments from an image or a video. However, none of existing works meet the expectations of digitizing high-quality animatable dynamic garments that can be adjusted to various unseen poses. In this paper, we propose the first method to recover high-quality animatable dynamic garments from monocular videos without depending on scanned data.

Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers

IEEE TCSVT, 2024

Code

Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. In this paper, we propose an end-to-end framework, GroupTransformer, for group detection in large-scale scenes.

HDhuman: High-quality Human Novel-view Rendering from Sparse Views

IEEE TVCG, 2024

Code

We aim to address the challenge of novel view rendering of human performers that wear clothes with complex texture patterns using a sparse set of camera views. Our method can render high-quality images at 2k resolution on novel views, and it is a general framework that is able to generalize to novel subjects.

Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning

ICCV, 2023

Code

We propose Narrator for naturally and controllably generating realistic and diverse human-scene interactions from textual descriptions, and further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation.

Crowd3D: Towards Hundreds of People Reconstruction from a Single Image

CVPR, 2023

Code

We propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. We also contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene.

Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing

CVPR, 2023

Code

3D human body representation learning has received increasing attention in recent years. However, existing works cannot flexibly, controllably and accurately represent human bodies, limited by coarse semantics and unsatisfactory representation capability, particularly in the absence of supervised data. In this paper, we propose a human body representation with fine-grained semantics and high reconstruction-accuracy in an unsupervised setting.

MILI: Multi-person Inference from a Low-resolution Image

Fundamental Research 2023

Code

Existing multi-person reconstruction methods require the human bodies in the input image to occupy a considerable portion of the picture. However, low-resolution human objects are ubiquitous due to trade-off between the field of view and target distance given a limited camera resolution. In this paper, we propose an end-to-end multi-task framework for multi-person inference from a low-resolution image (MILI).

High-Quality Reconstruction of Depth Maps From Graph-Based Non-Uniform Sampling

IEEE TMM, 2023

FOF: Learning Fourier Occupancy Field for Monocular Real-time Human Reconstruction

NIPS, 2022

Code

We propose Fourier Occupancy Field (FOF), a novel, powerful, efficient and flexible 3D geometry representation, for monocular real-time and accurate human reconstruction.

MH-HMR: Human Mesh Recovery from Monocular Images via Multi-Hypothesis Learning

CAAI TRIT 2023

Code

Recovering 3D human meshes from monocular images is an inherently ambiguous and challenging task due to depth ambiguity, joint occlusion and truncation. We propose a novel multi-hypothesis approach, MHHMR, for human mesh recovery, which can efficiently and adequately learn the feature representation of multiple hypotheses.

Learning to Infer Inner-Body under Clothing from Monocular Video

IEEE TVCG, 2022

Code Dataset

Accurately estimating the human inner-body under clothing is very important for body measurement, virtual try-on and VR/AR applications. In this paper, we propose the first method to allow everyone to easily reconstruct their own 3D inner-body under daily clothing from a self-captured video with the mean reconstruction error of 0.73cm within 15s. This avoids privacy concerns arising from nudity or minimal clothing.

High-Fidelity Human Avatars from a Single RGB Camera

CVPR, 2022

Code

We propose a coarse-to-fine framework to reconstruct a personalized high-fidelity human avatar from a monocular video. Our framework also enables photo-realistic novel view/pose synthesis and shape editing applications.

STATE: Learning Structure and Texture Representations for Novel View Synthesis

CVM, 2022

Code

We propose STATE, an end-to-end deep neural network, for sparse view synthesis by learning STructure And TExture representations. Our method also enables texture and structure editing applications benefitting from implicit disentanglement of structures and textures.

High Quality Rendered Dataset and Non-local Graph Convolutional Network for Intrinsic Image Decomposition

Journal of Image and Graphics (Chinese), 2022

Dataset

We propose an intrinsic decomposition framework and a new photorealistic rendered dataset for intrinsic image decomposition, which is rendered by leveraging large-scale 3D indoor scene models, along with high-quality textures and lighting to simulate the real-world environment. The chromatic shading components are first implemented.

Implicit Transformer Network for Screen Content Image Continuous Super-Resolution

NIPS, 2021

Code

We propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for screen content image super-resolution at arbitrary scales. We also construct a benchmark dataset with various screen contents.

Geometry-guided Dense Perspective Network for Speech-Driven Facial Animation

IEEE TVCG, 2021

Code

Realistic speech-driven 3D facial animation is a challenging problem due to the complex relationship between speech and face. In this paper, we propose a deep architecture, called Geometry-guided Dense Perspective Network (GDPnet), to achieve speaker-independent realistic 3D facial animation.

Image-Guided Human Reconstruction via Multi-Scale Graph Transformation Networks

IEEE TIP, 2021

Code Dataset

To reconstruct topology-consistent deformed human models, this paper proposes a novel deep learning framework with cascaded multi-scale graph transformation networks. D²Human (Dynamic Detailed Human) dataset is also presented and provided.

Deep Social Grouping Network for Large Scenes with Multiple Subjects

SCIENTIA SINICA Informationis (Chinese), 2021

Code

This paper proposes a fine-grained social grouping framework for gigapixel large scene images based on deep learning.

PISE: Person Image Synthesis and Editing with Decoupled GAN

CVPR, 2021

Code

This paper proposes a novel two-stage generative model for Person Image Synthesis and Editing, which is able to generate realistic person images with desired poses, textures, or semantic layouts.

Cross-MPI: Cross-scale Stereo for Image Super-Resolution using Multiplane Images

CVPR, 2021

Code

This paper proposes an end-to-end reference-based super-resolution network composed of a novel planeaware attention-based MPI mechanism, a multiscale guided upsampling module as well as a super-resolution synthesisand fusion module.

GPS-Net: Graph-based Photometric Stereo Network

NIPS, 2020

Code

This paper proposes a Graph-based Photometric Stereo Network, which unifies per-pixel and all-pixel processings to explore both inter-image and intra-image information.

PoNA: Pose-guided Non-local Attention for Human Pose Transfer

IEEE TIP, 2020

Code

This paper proposes a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks. Furthermore, our generated images can help to alleviate data insufficiency for person re-identification.

Human Pose Transfer by Adaptive Hierarchical Deformation

Computer Graphics Forum, 2020 (PG2020)

Code

This paper proposes an adaptive human pose transfer network with two hierarchical deformation levels. Our model has very few parameters and is fast to converge. Furthermore, our method can be applied to clothing texture transfer.

Learning to Reconstruct and Understand Indoor Scenes from Sparse Views

IEEE TIP, 2020

Code Dataset

This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation for indoor scenes. Our method only need a small number of (eg, 3-5) color images from uncalibrated sparse views, which significantly simplifies data acquisition and broadens applicable scenarios. We also make available a new indoor synthetic dataset, containing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts.

4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras

Code Dataset

CVPR, 2020

This paper contributes a novel realtime multi-person motion capture algorithm using multiview video inputs.

Full-Body Motion Capture for Multiple Closely Interacting Persons

Graphical Models, 2020

In this paper, we present a fully automatic and fast method to capture the total human performance including body poses, facial expression, hand gestures, and feet orientations for closely interacting multiple persons.

Discern Depth Under Foul Weather: Estimate PM2.5 for Depth Inference

IEEE Transactions on Industrial Informatics, 2020

Code Dataset

We propose an image-based method for PM2.5 estimation and a depth estimation method by capturing a single color image.

Generating 3D Faces using Multi-column Graph Convolutional Networks

Computer Graphics Forum, 2019 (PG2019)

Code

In this work, we introduce multi-column graph convolutional networks (MGCNs), a deep generative model for 3D mesh surfaces that effectively learns a non-linear facial representation. Moreover, with the help of variational inference, our model has excellent generating ability

CDnet: CNN-Based Cloud Detection for Remote Sensing Imagery

IEEE Transactions on Geoscience and Remote Sensing, 2019

Cloud detection is one of the important tasks for remote sensing image (RSI) preprocessing. In this paper, we utilize the thumbnail (i.e., preview image) of RSI, which contains the information of original multispectral or panchromatic imagery, to extract cloud mask efficiently. We also propose a cloud detection neural network (CDnet) with an encoder-decoder structure, a feature pyramid module (FPM), and a boundary refinement (BR) block.

3D Face Reprentation and Reconstruction with Multi-scale Graph Convolutional Autoencoders

IEEE ICME, 2019

We propose a multi-scale graph convolutional autoencoder for face representation and reconstruction. Our autoencoder uses graph convolution, which is easily trained for the data with graph structures and can be used for other deformable models. Our model can also be used for variational training to generate high quality face shapes.

Global As-Conformal-As-Possible Non-Rigid Registration of Multi-View Scans

IEEE ICME, 2019

We present a novel framework for global non-rigid registration of multi-view scans captured using consumer-level depth cameras. All scans from different viewpoints are allowed to undergo large non-rigid deformations and finally fused into a complete high quality model.

Global 3D Non-Rigid Registration of Deformable Objects Using a Single RGB-D Camera

IEEE TIP, 2019

We present a novel global non-rigid registration method for dynamic 3D objects. Our method allows objects to undergo large non-rigid deformations, and achieves high quality results even with substantial pose change or camera motion between views. In addition, our method does not require a template prior and uses less raw data than tracking based methods since only a sparse set of scans is needed.

Robust Non-Rigid Registration with Reweighted Position and Transformation Sparsity

IEEE TVCG, 2019

Won in the SHREC 2019 Contest

We propose a robust non-rigid registration method using reweighted sparsities on position and transformation to estimate the deformations between 3-D shapes.

Spatio-Temporal Reconstruction for 3D Motion Recovery

IEEE TCSVT, 2019

We address the challenge of 3D motion recovery by exploiting the spatio--temporal correlations of corrupted 3D skeleton sequences.

Tensor Completion From Structurally-Missing Entries by Low-TT-rankness and Fiber-wise Sparsity

JSTSP 2018

Most tensor completion methods assume that missing entries are randomly distributed in incomplete tensors, but this could be violated in practical applications where missing entries are not only randomly but also structurally distributed. To remedy this, we propose a novel tensor completion method equipped with double priors on the latent tensor, named tensor completion from structurally-missing entries by low tensor train (TT) rankness and fiber-wise sparsity.

Shape and Pose Estimation for Closely Interacting Persons Using Multi-view Images

Computer Graphics Forum, 2018 (PG2018)

We propose a fully-automatic markerless motion capture method to simultaneously estimate 3D poses and shapes of closely interacting people from multi-view sequences.

Intrinsic Image Decomposition With Sparse and Non-local Priors

ICME, 2017

Code World’s FIRST 10K Best Paper Award – Platinum

We propose a new intrinsic image decomposition method that decomposing a single RGB-D image into reflectance and shading components.

SPA: Sparse Photorealistic Animation Using a Single RGB-D Camera

IEEE TCSVT, 2017

We propose a marker-less performance capture method using sparse deformation to obtain the geometry and pose of the actor for each time instance in the database.

Video Super-resolution Using an Adaptived Superpixel-guided Auto-Regeressive Model

Pattern Recognition, 2016

Code

We propose a video super-resolution method based on an adaptive superpixel-guided auto-regressive (AR) model.

Foreground-Background Separation From Video Clips via Motion-assisted Matrix Restoration

IEEE TCSVT, 2015

We propose a motion-assisted matrix restoration (MAMR) model for foreground-background separation from video clips.

Non-Rigid Structure from Motion via Sparse Representation

IEEE Transactions on Cybernetics, 2015

We propose a new approach for non-rigid structure from motion with occlusion, based on sparse representation.

Graph-based Segmentation for RGB-D Data Using 3-D Geometry Enhanced Superpixels

IEEE Transactions on Cybernetics, 2015

We propose a two-stage segmentation method for RGB-D data: 1) oversegmentation by 3-D geometry enhanced superpixels; and 2) graph-based merging with label cost from superpixels.

Color-Guided Depth Recovery From RGB-D Data Using an Adaptive Autoregressive Model

ECCV, 2012/IEEE TIP, 2014

Code

We propose an adaptive color-guided autoregressive (AR) model for high quality depth recovery from low quality measurements captured by depth cameras.

Temporal-Dense Dynamic 3D Reconstruction with Low Frame Rate Cameras

IEEE JSTSP, 2012

We propose a new method for temporal-densely capturing and reconstructing dynamic scenes with low frame rate cameras, which consists of spatio-temporal sampling, spatio-temporal interpolation, and spatio-temporal fusion.

Three-Dimensional Motion Estimation via Matrix Completion

IEEE TSMCB, 2012

We propose a new 3D motion estimation method based on matrix completion.

Markerless Shape and Motion Capture from Multi-view Video Sequences

IEEE TCSVT, 2011

Multi-Camera and Multi-Lighting Dome

We construct a dome to record the geometry, texture and motion of human actors in a dedicated multiple-camera studio with controlled lighting and a chromakey background. The diameter of the dome is 6 meters which provides enough space for character perform. 40 PointGrey flea2 cameras are ring-shape arranged on the dome and 320 LEDs are evenly spaced on the hemisphere of the dome.

Links

College of Intelligence and Computing

Tianjin University (Peiyang University)