TLDR; We propose -
Humans and Structure from Motion (HSfM) - Our approach
integrates Human Mesh Reconstruction and Structure from Motion to
jointly estimate 3D human pose and shape, scene point maps, and cameras
poses in a metric world coordinate frame.
Our approach places people in the scene and improves camera pose and scene
reconstruction. Here we show a top view of gym environment before HSfM
optimization (DUSt3R output) and after optimization with HSfM Loss.
Humans and Structure from Motion enables capturing people interacting with
their environment as well as the spatial positioning between individuals.
Here we show a reconstruction of three people building Lego together.
Overview
Humans and Structure from Motion (HSfM) is a novel method that jointly
reconstructs 3D humans, scene, and cameras from a sparse set of
uncalibrated images. To achieve this, HSfM combines Human Mesh Recovery
(HMR) methods for local human pose estimation and Structure from Motion
(SfM) techniques for scene and camera reconstruction and to localize
people. Specifically, our approach combines camera and scene
reconstruction from data-driven SfM methods, such as DUSt3R, with the
bundle adjustment step from traditional SfM applied to 2D keypoints
where a human body model provides 3D human meshes and constrains human
size.
Step by Step Method
Step 1: Input Processing and Feature Extraction
From the sparse input images, we extract 2D human joints using
VIT-Pose and estimate 3D joint positions and body shape using HMR2.
For scene and camera reconstruction, we use DUSt3R, a state-of-the-art
data-driven SfM method. We assume known re-identification of people
across camera views.
Step 2: Resolving Scale Ambiguity
SfM methods often suffer from scale ambiguity. We address this by
estimating camera parameters based on human body size and orientation.
After alignment, people are roughly placed in a world of consistent
scale.
Step 3: Joint Human and Scene Reconstruction
We adapt the global alignment loss from DUSt3R to jointly estimate
humans and the scene. The HSfM loss consists of three terms: a
keypoint re-projection term performing bundle adjustment on 2D body
joints, a body shape regularizer, and the global alignment loss from
DUSt3R.
Evaluation
This joint reasoning not only enables accurate human placement in the
scene. Notably, it also improves camera poses and the scene
reconstruction itself. Evaluations on public benchmarks show significant
improvements. Here we show the camera angle (RRA) and scaled translation
(s-CCA) accuracy in percent at a threshold of 10 degree / meter on the
EgoHumans
benchmark.
This project is supported in part by DARPA No. HR001123C0021, IARPA DOI/IBC No. 140D0423C0035, NSF:CNS-2235013,
ONR MURI N00014-21-1-2801, Bakar Fellows, and Bair Sponsors. The views and conclusions contained herein are those
of the authors and do not represent the official policies or endorsements of these institutions.
We also thank Chung Min Kim for her critical reviews of this paper and Junyi Zhang for his
valuable insights on the method.
Citation
@article{mueller2024hsfm,
title={Reconstructing People, Places, and Cameras},
author={Lea M\"uller and Hongsuk Choi and Anthony Zhang
and Brent Yi and Jitendra Malik and Angjoo Kanazawa},
year={2024},
journal={arXiv:2412.17806},
}