Generative Proxemics: A Prior for 3D Social Interaction from Images

  • 1 MPI for Intelligent Systems, Tübingen 2 UC Berkeley

Abstract

Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and effects the dynamics of social interaction. We present a novel approach that learns a 3D proxemics prior of two people in close social interaction. Since collecting a large 3D dataset interacting people is a challenge, we rely on 2D image collections where social interactions are abundant. We achieve this by resconstructing pseudo-ground truth 3D meshes of interacting people from images with an optimization approach using existing ground-truth contact maps. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution of two people in close social interaction directly in the SMPL-X parameter space. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a user study. Additionally, we introduce a new optimization method that uses the diffusion prior to reconstruct two people in close proximity from a single image without any contact annotation. Our approach recovers more accurate and plausible 3D social interactions from noisy initial estimates and outperforms state-of-the-art methods.

Sampling meshes from BUDDI

BUDDI is a diffusion model that learned the joint distribution of two people in close proxeminty. It directly generates SMPL-X body model parameters for two people, starting from random noise.

Optimization with generative proxemics

We show how BUDDI can be used as prior during optimization via an SDS loss inspired by DreamFusion. This approach does not require ground-truth contact annotations.

Comparison 'Optimization with BUDDI' vs. BEV

          Optimization with BUDDI BEV                           

FlickrCI3D Pseudo-ground truth fits

To train BUDDI, we create SMPL-X meshes in interaction. To do this, we fit SMPL-X meshes to FlickrCI3D Signatures using the provided ground-truth contact annotations. Here you can see examples of FlickrCI3D PGT.

How does BUDDI work?

Given a training image collection of people in close proximity (FlickrCI3D Signatures), we create pseudo-ground truth SMPL-X fits via optimization using ground-truth contact annotations. SMPL-X is a human body model that maps pose, shape, global orientation and translation (among others) to a 3D mesh. Given the FlickrCI3D PGT, we propose a diffusion-based approach that learns a 3D generative model of two people in close social interaction. BUDDI directly operates on SMPL-X parameters through a transformer backbone. he model can be used to generate unconditional samples or also as a social prior in the downstream optimization task of reconstructing 3D people in close proximity from images, without any extra annotation such as contact maps.

To use BUDDI as a prior in optimization, we adopt the SDS loss presented in previous work like DreamFusion. When fitting SMPL-X to image keypoints, in each optimization iteration, we diffuse the current estimate diffuse it and let BUDDI propose a refined estimate. The refined estimate is closer to the true distribution of interacting people and serves as prior via an L2-Loss. This enables us to fit 3D meshes to images of closely interacting people without replying on ground-truth contact annotations at test time.

BibTeX