WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

@ ICCV 2025
Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang
Yonsei University

Our approach utilizes the view guidance prior through a warping algorithm on diffusion models to generate view-consistent images from a single scene-level image.

Our method is compared with the diffusion-based model MegaScenes and a multi-step pipeline which utilizes a 3D rendering technique, VistaDream. Since VistaDream uses more frames than the diffusion models, its results tend to be smoother. Although our method does not rely on rendering techniques, it still produces high-quality, consistent images.

Abstract

Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.

Video

Method

Illustration of WAVE. Given an input view, depth map and continuous camera poses, our method generates scene images with smooth view transitions through three distinct processes: (a) adaptive warp-range selection utilizes warped region masks, from which the relevance between viewpoints is determined to compute the reference range for attention. (b) pose-aware noise initialization (PANI) re-initializes the diffusion initial noise by leveraging warped images and initial noise, incorporating frequency domain information. (c) warp-guided adaptive attention (WGAA) utilizes the warped region mask and reference range obtained from the adaptive warp-range selection, performing masked batch self-attention.

Different Pose NVS with WAVE

By applying our proposed method, we generate results across multiple camera poses, demonstrating its effectiveness in handling different viewpoints. Additionally, to further illustrate its capability, we perform additional generations along another pose trajectory, providing diverse examples of novel view synthesis.

Consistency & Camera Accuracy Metrics

We conduct experiments on the three datasets MegaScenes, DTU, and RE10K and evaluate consistency using two metrics: next metric, which assesses consistency with neighboring viewpoints, and first metric, which measures consistency with the input view. Additionally, to evaluate camera accuracy, a crucial aspect of novel view synthesis, we report the accuracy of camera extrinsic parameters using Frobenius Norm, Rotation Angle Difference, and Angular Consistency.

Overall Comparison of Methods

The size of the circles represents inference time. Applying our method improves performance without a significant increase in computation cost.