Kun Li 李坤
中文版Chinese Version



College of Intelligence and Computing,
Tianjin University (Peiyang University) , Tianjin 300350, China

Email: lik@tju.edu.cn

Our codes are available at: https://github.com/3DV-TJU


PPT of annual progress report in the field of digital humans (talk in China3DV 2024) is available here.

Selected Publications
LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging
CVPR, 2024 1 Code
Lensless imaging system can have a smaller size, simpler structure and stronger privacy protection attributes, and hence can be adapted to a variety of complex environments. In this paper, we propose the first end-to-end framework to recover 3D human poses and shapes from lensless measurements, to our best knowledge.
Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints
CVPR, 2024 1 Code
3D human generation is increasingly significant in various applications. However, the direct use of 2D generative methods in 3D generation often results in significant loss of local details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce Joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly, ensuring both global structure and local details.
High-Quality Animatable Dynamic Garment Reconstruction from Monocular Videos
IEEE T-CSVT, 2023 Code
Much progress has been made in reconstructing garments from an image or a video. However, none of existing works meet the expectations of digitizing high-quality animatable dynamic garments that can be adjusted to various unseen poses. In this paper, we propose the first method to recover high-quality animatable dynamic garments from monocular videos without depending on scanned data.
Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers
IEEE T-CSVT, 2023 Code
Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. In this paper, we propose an end-to-end framework, GroupTransformer, for group detection in large-scale scenes.
Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning
ICCV, 2023 Code
We propose Narrator for naturally and controllably generating realistic and diverse human-scene interactions from textual descriptions, and further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation.
HDhuman: High-quality Human Novel-view Rendering from Sparse Views
IEEE TVCG, 2023 Code
We aim to address the challenge of novel view rendering of human performers that wear clothes with complex texture patterns using a sparse set of camera views. Our method can render high-quality images at 2k resolution on novel views, and it is a general framework that is able to generalize to novel subjects.
Crowd3D: Towards Hundreds of People Reconstruction from a Single Image
CVPR, 2023 Code
We propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. We also contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene.
Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing
CVPR, 2023 Code
3D human body representation learning has received increasing attention in recent years. However, existing works cannot flexibly, controllably and accurately represent human bodies, limited by coarse semantics and unsatisfactory representation capability, particularly in the absence of supervised data. In this paper, we propose a human body representation with fine-grained semantics and high reconstruction-accuracy in an unsupervised setting.
MILI: Multi-person Inference from a Low-resolution Image
Fundamental Research 2023 Code
Existing multi-person reconstruction methods require the human bodies in the input image to occupy a considerable portion of the picture. However, low-resolution human objects are ubiquitous due to trade-off between the field of view and target distance given a limited camera resolution. In this paper, we propose an end-to-end multi-task framework for multi-person inference from a low-resolution image (MILI).
High-Quality Reconstruction of Depth Maps From Graph-Based Non-Uniform Sampling
IEEE TMM, 2023 1
Lensless imaging system can have a smaller size, simpler structure and stronger privacy protection attributes, and hence can be adapted to a variety of complex environments. In this paper, we propose the first end-to-end framework to recover 3D human poses and shapes from lensless measurements, to our best knowledge.
FOF: Learning Fourier Occupancy Field for Monocular Real-time Human Reconstruction
NIPS, 2022 Code
We propose Fourier Occupancy Field (FOF), a novel, powerful, efficient and flexible 3D geometry representation, for monocular real-time and accurate human reconstruction.
MH-HMR: Human Mesh Recovery from Monocular Images via Multi-Hypothesis Learning
CAAI TRIT 2023 Code
Recovering 3D human meshes from monocular images is an inherently ambiguous and challenging task due to depth ambiguity, joint occlusion and truncation. We propose a novel multi-hypothesis approach, MHHMR, for human mesh recovery, which can efficiently and adequately learn the feature representation of multiple hypotheses.
Learning to Infer Inner-Body under Clothing from Monocular Video
IEEE TVCG, 2022 Code Dataset
Accurately estimating the human inner-body under clothing is very important for body measurement, virtual try-on and VR/AR applications. In this paper, we propose the first method to allow everyone to easily reconstruct their own 3D inner-body under daily clothing from a self-captured video with the mean reconstruction error of 0.73cm within 15s. This avoids privacy concerns arising from nudity or minimal clothing.
High-Fidelity Human Avatars from a Single RGB Camera
CVPR, 2022 Code
We propose a coarse-to-fine framework to reconstruct a personalized high-fidelity human avatar from a monocular video. Our framework also enables photo-realistic novel view/pose synthesis and shape editing applications.
STATE: Learning Structure and Texture Representations for Novel View Synthesis
CVM, 2022 Code
We propose STATE, an end-to-end deep neural network, for sparse view synthesis by learning STructure And TExture representations. Our method also enables texture and structure editing applications benefitting from implicit disentanglement of structures and textures.
High Quality Rendered Dataset and Non-local Graph Convolutional Network for Intrinsic Image Decomposition
Journal of Image and Graphics (Chinese), 2022 Dataset
We propose an intrinsic decomposition framework and a new photorealistic rendered dataset for intrinsic image decomposition, which is rendered by leveraging large-scale 3D indoor scene models, along with high-quality textures and lighting to simulate the real-world environment. The chromatic shading components are first implemented.
Implicit Transformer Network for Screen Content Image Continuous Super-Resolution
NIPS, 2021 Code
We propose a novel Implicit Transformer Super-Resolution Network (ITSRN) for screen content image super-resolution at arbitrary scales. We also construct a benchmark dataset with various screen contents.
Geometry-guided Dense Perspective Network for Speech-Driven Facial Animation
IEEE TVCG, 2021 Code
Realistic speech-driven 3D facial animation is a challenging problem due to the complex relationship between speech and face. In this paper, we propose a deep architecture, called Geometry-guided Dense Perspective Network (GDPnet), to achieve speaker-independent realistic 3D facial animation.
Image-Guided Human Reconstruction via Multi-Scale Graph Transformation Networks
IEEE TIP, 2021 Code Dataset
To reconstruct topology-consistent deformed human models, this paper proposes a novel deep learning framework with cascaded multi-scale graph transformation networks. D2Human (Dynamic Detailed Human) dataset is also presented and provided.
Deep Social Grouping Network for Large Scenes with Multiple Subjects
SCIENTIA SINICA Informationis (Chinese), 2021 Code
This paper proposes a fine-grained social grouping framework for gigapixel large scene images based on deep learning.
PISE: Person Image Synthesis and Editing with Decoupled GAN
CVPR, 2021 Code
This paper proposes a novel two-stage generative model for Person Image Synthesis and Editing, which is able to generate realistic person images with desired poses, textures, or semantic layouts.
Cross-MPI: Cross-scale Stereo for Image Super-Resolution using Multiplane Images
CVPR, 2021 Code
This paper proposes an end-to-end reference-based super-resolution network composed of a novel planeaware attention-based MPI mechanism, a multiscale guided upsampling module as well as a super-resolution synthesisand fusion module.
GPS-Net: Graph-based Photometric Stereo Network
NIPS, 2020 Code
This paper proposes a Graph-based Photometric Stereo Network, which unifies per-pixel and all-pixel processings to explore both inter-image and intra-image information.
PoNA: Pose-guided Non-local Attention for Human Pose Transfer
IEEE TIP, 2020 Code
This paper proposes a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks. Furthermore, our generated images can help to alleviate data insufficiency for person re-identification.
Human Pose Transfer by Adaptive Hierarchical Deformation
Computer Graphics Forum, 2020 (PG2020) Code
This paper proposes an adaptive human pose transfer network with two hierarchical deformation levels. Our model has very few parameters and is fast to converge. Furthermore, our method can be applied to clothing texture transfer.
Learning to Reconstruct and Understand Indoor Scenes from Sparse Views
IEEE TIP, 2020 Code Dataset
This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation for indoor scenes. Our method only need a small number of (eg, 3-5) color images from uncalibrated sparse views, which significantly simplifies data acquisition and broadens applicable scenarios. We also make available a new indoor synthetic dataset, containing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts.
4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras Code Dataset
CVPR, 2020
This paper contributes a novel realtime multi-person motion capture algorithm using multiview video inputs.
Full-Body Motion Capture for Multiple Closely Interacting Persons
Graphical Models, 2020
In this paper, we present a fully automatic and fast method to capture the total human performance including body poses, facial expression, hand gestures, and feet orientations for closely interacting multiple persons.
Discern Depth Under Foul Weather: Estimate PM2.5 for Depth Inference
IEEE Transactions on Industrial Informatics, 2020 Code Dataset
We propose an image-based method for PM2.5 estimation and a depth estimation method by capturing a single color image.
Generating 3D Faces using Multi-column Graph Convolutional Networks
Computer Graphics Forum, 2019 (PG2019) Code
In this work, we introduce multi-column graph convolutional networks (MGCNs), a deep generative model for 3D mesh surfaces that effectively learns a non-linear facial representation. Moreover, with the help of variational inference, our model has excellent generating ability
CDnet: CNN-Based Cloud Detection for Remote Sensing Imagery
IEEE Transactions on Geoscience and Remote Sensing, 2019
Cloud detection is one of the important tasks for remote sensing image (RSI) preprocessing. In this paper, we utilize the thumbnail (i.e., preview image) of RSI, which contains the information of original multispectral or panchromatic imagery, to extract cloud mask efficiently. We also propose a cloud detection neural network (CDnet) with an encoder-decoder structure, a feature pyramid module (FPM), and a boundary refinement (BR) block. 
3D Face Reprentation and Reconstruction with Multi-scale Graph Convolutional Autoencoders
We propose a multi-scale graph convolutional autoencoder for face representation and reconstruction. Our autoencoder uses graph convolution, which is easily trained for the data with graph structures and can be used for other deformable models. Our model can also be used for variational training to generate high quality face shapes.
Global As-Conformal-As-Possible Non-Rigid Registration of Multi-View Scans
We present a novel framework for global non-rigid registration of multi-view scans captured using consumer-level depth cameras. All scans from different viewpoints are allowed to undergo large non-rigid deformations and finally fused into a complete high quality model.
Global 3D Non-Rigid Registration of Deformable Objects Using a Single RGB-D Camera
IEEE TIP, 2019
We present a novel global non-rigid registration method for dynamic 3D objects. Our method allows objects to undergo large non-rigid deformations, and achieves high quality results even with substantial pose change or camera motion between views. In addition, our method does not require a template prior and uses less raw data than tracking based methods since only a sparse set of scans is needed.
Robust Non-Rigid Registration with Reweighted Position and Transformation Sparsity
IEEE TVCG, 2019 Won in the SHREC 2019 Contest
We propose a robust non-rigid registration method using reweighted sparsities on position and transformation to estimate the deformations between 3-D shapes.
Spatio-Temporal Reconstruction for 3D Motion Recovery
We address the challenge of 3D motion recovery by exploiting the spatio--temporal correlations of corrupted 3D skeleton sequences.
Tensor Completion From Structurally-Missing Entries by Low-TT-rankness and Fiber-wise Sparsity
JSTSP 2018
Most tensor completion methods assume that missing entries are randomly distributed in incomplete tensors, but this could be violated in practical applications where missing entries are not only randomly but also structurally distributed. To remedy this, we propose a novel tensor completion method equipped with double priors on the latent tensor, named tensor completion from structurally-missing entries by low tensor train (TT) rankness and fiber-wise sparsity.
Shape and Pose Estimation for Closely Interacting Persons Using Multi-view Images
Computer Graphics Forum, 2018 (PG2018)
We propose a fully-automatic markerless motion capture method to simultaneously estimate 3D poses and shapes of closely interacting people from multi-view sequences.
Intrinsic Image Decomposition With Sparse and Non-local Priors
ICME, 2017 Code World’s FIRST 10K Best Paper Award – Platinum
We propose a new intrinsic image decomposition method that decomposing a single RGB-D image into reflectance and shading components.
SPA: Sparse Photorealistic Animation Using a Single RGB-D Camera
We propose a marker-less performance capture method using sparse deformation to obtain the geometry and pose of the actor for each time instance in the database.
Video Super-resolution Using an Adaptived Superpixel-guided Auto-Regeressive Model
Pattern Recognition, 2016 Code
We propose a video super-resolution method based on an adaptive superpixel-guided auto-regressive (AR) model.
Foreground-Background Separation From Video Clips via Motion-assisted Matrix Restoration
We propose a motion-assisted matrix restoration (MAMR) model for foreground-background separation from video clips.
Non-Rigid Structure from Motion via Sparse Representation
IEEE Transactions on Cybernetics, 2015
We propose a new approach for non-rigid structure from motion with occlusion, based on sparse representation.
Graph-based Segmentation for RGB-D Data Using 3-D Geometry Enhanced Superpixels
IEEE Transactions on Cybernetics, 2015
We propose a two-stage segmentation method for RGB-D data: 1) oversegmentation by 3-D geometry enhanced superpixels; and 2) graph-based merging with label cost from superpixels.
Color-Guided Depth Recovery From RGB-D Data Using an Adaptive Autoregressive Model
ECCV, 2012/IEEE TIP, 2014 Code
We propose an adaptive color-guided autoregressive (AR) model for high quality depth recovery from low quality measurements captured by depth cameras.
Temporal-Dense Dynamic 3D Reconstruction with Low Frame Rate Cameras
We propose a new method for temporal-densely capturing and reconstructing dynamic scenes with low frame rate cameras, which consists of spatio-temporal sampling, spatio-temporal interpolation, and spatio-temporal fusion.
Three-Dimensional Motion Estimation via Matrix Completion
We propose a new 3D motion estimation method based on matrix completion.
Markerless Shape and Motion Capture from Multi-view Video Sequences
We propose a new method for temporal-densely capturing and reconstructing dynamic scenes with low frame rate cameras, which consists of spatio-temporal sampling, spatio-temporal interpolation, and spatio-temporal fusion.
Multi-Camera and Multi-Lighting Dome
We construct a dome to record the geometry, texture and motion of human actors in a dedicated multiple-camera studio with controlled lighting and a chromakey background. The diameter of the dome is 6 meters which provides enough space for character perform. 40 PointGrey flea2 cameras are ring-shape arranged on the dome and 320 LEDs are evenly spaced on the hemisphere of the dome.

College of Intelligence and Computing
Tianjin University (Peiyang University)