Photo2Avatar

Reconstructing animatable 3D human avatars from minimal visual input is a challenging task in digital human modeling and virtual content creation. Existing methods predominantly rely on multi-view observations or monocular videos, limiting their applicability in image-sparse scenarios. We introduce Photo2Avatar, a unified pipeline that reconstructs a 3D human avatar from a single image while supporting SMPL-X-driven animation. The method first synthesizes a dense set of multi-view images guided by parametric body priors, enabling spatially consistent supervision from a monocular portrait. A face-aware consistency enhancement module is then applied to improve identity preservation and cross-view coherence, particularly in facial regions. These refined views are used to supervise an animatable avatar learner under a differentiable rendering objective, allowing motion-conditioned geometry and appearance learning with minimal input. Extensive experiments demonstrate that Photo2Avatar achieves superior identity consistency, visual quality, and animation controllability compared to existing single-image baselines. The proposed method offers a practical solution for one-shot digital human reconstruction and bridges the gap between static image perception and dynamic 3D avatar animation.

Photo2Avatar: Single-Image to Animatable 3D Human Avatar via Multi-view Synthesis and Face-Aware Consistency Enhancement

ICVRV 2025

Abstract

Framework