CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

1Sun Yat-Sen University 2Shanghai AI Laboratory 3SenseTime

*Equal Contribution Correspondence

Paper Code arXiv

Abstract

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes.

Method

Interpolate start reference image.
Framework of CrossViewDiff

Overview of CrossViewDiff. Initially, we utilize depth estimation methods to create 3D voxels as intermediaries for information across different viewpoints. Subsequently, we establish structural and textural control constraints based on the satellite and 3D voxels. Finally, using an enhanced Cross-Attention mechanism, we integrate CrossView Controls information, guiding the Denoising Process to produce the synthesized street-view images.

Results

Qualitative Comparison

Interpolate start reference image.
Qualitative comparison of synthesis results on CVUSA, CVACT and OmniCity

Qualitative Ablation

Interpolate start reference image.
Qualitative ablation results on CVUSA and OmniCity

Evaluation

Quantitative Evaluation

We present a quantitative comparison of different methods on the CVUSA, CVACT and OmniCity datasets, evaluating them in terms of various metrics. Compared to the state-of-the-art method for cross-view synthesis (Sat2Density), our method achieved significant improvements in SSIM and FID scores by 9.44% and 42.70% on CVUSA, respectively. Similarly, enhancements of 6.46% and 10.94% in SSIM and FIDwere observed on CVACT. Our method achieved significant improvements in SSIM and FID by 11.71% and 52.21% on OmniCity, respectively.

Interpolate start reference image.
Quantitative comparison of different methods on CVUSA,CVACT and OmniCity datasets in terms of various evaluation metrics

GPT-based Evaluation

Beyond conventional similarity and realism metrics, we also leverage the powerful visual-linguistic capabilities of existing MLLM large models to design CrossScore for evaluating synthetic images.

Interpolate start reference image.
The overall process for automated evaluation using GPT-4o

Our method significantly outperforms other GAN-based and diffusion-based generation methods in the three evaluation dimensions of Consistency, Visual Realism, and Perceptual Quality. This also indicates that the street-view images synthesized by our method are more aligned with human user needs, which aids in subsequent applications such as virtual scene tasks.

Interpolate start reference image.
GPT-based evaluation metrics for Cross-View Synthesis

BibTeX

BibTex Code