SkyDiffusion

Leveraging BEV Paradigm for
Ground-to-Aerial Image Synthesis

Junyan Ye^1,2, Jun He¹, Weijia Li^{1 $†$}, Zhutao Lv¹

Jinhua Yu¹, Haote Yang², Conghui He^{2,3 $†$}

¹Sun Yat-Sen University ²Shanghai AI Laboratory ³SenseTime Research

📄 Paper / Code / 🤗 Dataset

Task Description
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view.

Challenge
The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging.

Method
We introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm.

New Dataset
We introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, historical high-resolution satellite image synthesis, and low-altitude UAV image synthesis tasks.

The video shows the result of pulling up from the ground street view to the aerial perspective, like a sky-down perspective. The video comes from Google Engine rendering and MatrixCity rendering

Ground-to-Aerial Image Synthesis results.

SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), generating realistic, consistent aerial images.

CVACT

CVUSA

Vigor

G2A-3

Ground2Aerial-3 Dataset

We propose Ground2Aerial-3, a multi-task cross-view synthesis dataset designed to explore the performance of cross-view synthesis methods in several novel scenarios. As shown in Figure 3, G2A-3 contains nearly 10k pairs of street-view and aerial images, covering disaster scene aerial image synthesis, historical high-resolution satellite image synthesis, and low-altitude UAV image synthesis. The dataset of each task is randomly split into training and test sets with a ratio of 5:1. The ground street-view images are 1024×512, with true north aligned at the center, and the aerial images are 512×512, aligned with the center of the ground images.

For the low-altitude UAV image synthesis task, we used the virtual MatrixCity environment and the UE engine to position six single-view cameras at ground level with different perspectives. These include four horizontal perspectives, one upward, and one downward view. The six single-view images are initially rendered and then stitched into panoramic data using the py360convert library. Simultaneously, at the same xy position but at an altitude of 20m, a UAV is simulated with a downward-facing camera to generate UAV data, forming a cross-view dataset.

The images below show sample visualizations of data from different tasks.

Method

Interpolate start reference image.

Overview of the proposed SkyDiffusion framework, including the curved BEV transformation and BEV-controlled diffusion model. The lower parts present the results of One-to-One and Multi-to-One BEV transformations, respectively.

Evaluation

Quantitative Evaluation of Existing datasets

On the suburban CVUSA and CVACT datasets, our SkyDiffusion method achieved the outstanding results. Compared to state-of-the-art methods, it reduced FID by 25.72% and increased SSIM by 7.68%, demonstrating its superiority in synthesizing realistic and consistent satellite images. In the urban VIGOR-Chicago dataset, SkyDiffusion reduced FID by 14.9% and improved SSIM by 9.41% compared to the state-of-the-art method.

Interpolate start reference image.

The tasks on the G2A-3 dataset present certain challenges; however, our method achieves significant performance improvements over the commonly used image-conditioned synthesis method, ControlNet. SkyDiffusion reduces the FID by an average of 25.81% and increases the SSIM by an average of 12.88%.

Interpolate start reference image.

We conducted ablation experiments on the datasets. Compared to directly using street-view images as input, the Curved-BEV method improves performance across multiple metrics by transforming street-view images into satellite views for domain alignment. This indicates that the Curved-BEV method aids in synthesizing more content-consistent satellite images. Furthermore, the Multi-to-one method further improves metrics compared to the one-to-one mapping, demonstrating its effectiveness in dense urban scenes.

Conclusion

In this study, we introduce SkyDiffusion, a novel approach specifically designed for street images to satellite images cross-view synthesis. SkyDiffusion operates solely with street images as input, utilizing a BEV Paradigm and diffusion models to generate satellite images. SkyDiffusion achieves state-of-the-art performance in both content consistency and image realism on across multiple cross-view datasets, demonstrating its superior capabilities. Additionally, we introduce a cross-view synthesis dataset, Ground2Aerial-3, featuring aerial image synthesis tasks for multiple new scenes, providing practical value and inspiration for future cross-view synthesis research.