TLDR; We propose - Humans and Structure from Motion (HSfM) - Our approach integrates Human Mesh Reconstruction and Structure from Motion to jointly estimate 3D human pose and shape, scene point maps, and camera poses in a metric world coordinate frame.
Humans and Structure from Motion (HSfM) is a novel method that jointly reconstructs 3D humans, scene, and cameras from a sparse set of uncalibrated images. To achieve this, HSfM combines Human Mesh Recovery (HMR) methods for local human pose estimation and Structure from Motion (SfM) techniques for scene and camera reconstruction and to localize people. Specifically, our approach combines camera and scene reconstruction from data-driven SfM methods, such as DUSt3R, with the bundle adjustment step from traditional SfM applied to 2D keypoints where a human body model provides 3D human meshes and constrains human size.
This joint reasoning not only enables accurate human placement in the scene. Notably, it also improves camera poses and the scene reconstruction itself. Evaluations on public benchmarks show significant improvements. Here we show the camera angle (RRA) and scaled translation (s-CCA) accuracy in percent at a threshold of 10 degree / meter on the EgoHumans benchmark.
@article{mueller2024hsfm,
title={Reconstructing People, Places, and Cameras},
author={Lea M\"uller and Hongsuk Choi and Anthony Zhang
and Brent Yi and Jitendra Malik and Angjoo Kanazawa},
year={2024},
journal={arXiv:2412.17806},
}