Task Description
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view
images while maintaining consistent content layout, simulating a top-down view.
Challenge
The significant viewpoint difference
leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this
cross-view generation task particularly challenging.
Method
We introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from
street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm.
New Dataset
We introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial
image synthesis
applications, including disaster scene aerial synthesis, historical high-resolution satellite image synthesis, and
low-altitude UAV image synthesis tasks.
The video shows the result of pulling up from the ground street view to the aerial perspective, like a sky-down
perspective. The video comes from Google Engine rendering and MatrixCity rendering
Ground-to-Aerial Image Synthesis results.
SkyDiffusion outperforms state-of-the-art methods on cross-view datasets
across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and
various application scenarios (G2A-3), generating realistic, consistent
aerial images.
CVACT
CVUSA
Vigor
G2A-3
Ground2Aerial-3 Dataset
We propose Ground2Aerial-3, a multi-task cross-view synthesis dataset
designed to explore the performance of cross-view synthesis methods in
several novel scenarios. As shown in Figure 3, G2A-3 contains nearly
10k pairs of street-view and aerial images, covering disaster scene
aerial image synthesis, historical high-resolution satellite image
synthesis, and low-altitude UAV image synthesis. The dataset of each
task is randomly split into training and test sets with a ratio of
5:1. The ground street-view images are 1024×512, with true north
aligned at the center, and the aerial images are 512×512, aligned with
the center of the ground images.
For the low-altitude UAV image synthesis task, we used the virtual
MatrixCity environment and the UE engine to position six single-view
cameras at ground level with different perspectives. These include
four horizontal perspectives, one upward, and one downward view. The
six single-view images are initially rendered and then stitched into
panoramic data using the py360convert library. Simultaneously, at the
same xy position but at an altitude of 20m, a UAV is simulated with a
downward-facing camera to generate UAV data, forming a cross-view
dataset.
The images below show sample visualizations of data from different
tasks.
Method
Overview of the proposed SkyDiffusion framework, including the curved BEV transformation and BEV-controlled diffusion
model. The lower parts present the results of
One-to-One and Multi-to-One BEV transformations, respectively.
Evaluation
Quantitative Evaluation of Existing datasets
On the suburban CVUSA and CVACT datasets, our SkyDiffusion method
achieved the outstanding results. Compared to state-of-the-art methods,
it reduced FID by 25.72% and increased SSIM by 7.68%,
demonstrating its superiority in synthesizing realistic and consistent
satellite images. In the urban VIGOR-Chicago dataset, SkyDiffusion
reduced FID by 14.9% and improved SSIM by 9.41% compared
to the state-of-the-art method.
The tasks on the G2A-3 dataset present certain challenges; however, our
method achieves significant performance improvements over the commonly
used image-conditioned synthesis method, ControlNet. SkyDiffusion
reduces the FID by an average of 25.81% and increases the SSIM by
an average of 12.88%.
We conducted ablation experiments on the datasets. Compared to
directly using street-view images as input, the Curved-BEV method
improves performance across multiple metrics by transforming
street-view images into satellite views for domain alignment. This
indicates that the Curved-BEV method aids in synthesizing more
content-consistent satellite images. Furthermore, the Multi-to-one
method further improves metrics compared to the one-to-one mapping,
demonstrating its effectiveness in dense urban scenes.
Conclusion
In this study, we introduce SkyDiffusion, a novel approach specifically
designed for street images to satellite images cross-view synthesis.
SkyDiffusion operates solely with street images as input, utilizing a
BEV Paradigm and diffusion models to generate satellite images.
SkyDiffusion achieves state-of-the-art performance in both content
consistency and image realism on across multiple cross-view datasets,
demonstrating its superior capabilities. Additionally, we introduce a
cross-view synthesis dataset, Ground2Aerial-3, featuring aerial image
synthesis tasks for multiple new scenes, providing practical value and
inspiration for future cross-view synthesis research.