1Sun Yat-Sen University2Shanghai AI Laboratory3SenseTime Research
Task Description
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view
images while maintaining consistent content layout, simulating a top-down view.
Challenge
The significant viewpoint difference
leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this
cross-view generation task particularly challenging.
Method
We introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from
street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm.
New Dataset
We introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial
image synthesis
applications, including disaster scene aerial synthesis, historical high-resolution satellite image synthesis, and
low-altitude UAV image synthesis tasks.
The video shows the result of pulling up from the ground street view to the aerial perspective, like a sky-down
perspective. The video comes from Google Engine rendering and MatrixCity rendering
Ground-to-Aerial Image Synthesis results.
SkyDiffusion outperforms state-of-the-art methods on cross-view datasets
across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and
various application scenarios (G2A-3), generating realistic, consistent
aerial images.
CVACT
CVUSA
Vigor
G2A-3
Ground2Aerial-3 Dataset
We propose Ground2Aerial-3, a multi-task cross-view synthesis dataset
designed to explore the performance of cross-view synthesis methods in
several novel scenarios. As shown in Figure 3, G2A-3 contains nearly
10k pairs of street-view and aerial images, covering disaster scene
aerial image synthesis, historical high-resolution satellite image
synthesis, and low-altitude UAV image synthesis. The dataset of each
task is randomly split into training and test sets with a ratio of
5:1. The ground street-view images are 1024×512, with true north
aligned at the center, and the aerial images are 512×512, aligned with
the center of the ground images.
For the low-altitude UAV image synthesis task, we used the virtual
MatrixCity environment and the UE engine to position six single-view
cameras at ground level with different perspectives. These include
four horizontal perspectives, one upward, and one downward view. The
six single-view images are initially rendered and then stitched into
panoramic data using the py360convert library. Simultaneously, at the
same xy position but at an altitude of 20m, a UAV is simulated with a
downward-facing camera to generate UAV data, forming a cross-view
dataset.
The images below show sample visualizations of data from different
tasks.
Method
Overview of the proposed SkyDiffusion framework, including the curved BEV transformation and BEV-controlled diffusion
model with light manipulation. The lower parts present the results of
One-to-One and Multi-to-One BEV transformations, respectively.
Evaluation
Quantitative Evaluation of Existing datasets
On the suburban CVUSA and CVACT datasets, our SkyDiffusion method
achieved the outstanding results. Compared to state-of-the-art methods,
it reduced FID by 25.72% and increased SSIM by 7.68%,
demonstrating its superiority in synthesizing realistic and consistent
satellite images. In the urban VIGOR-Chicago dataset, SkyDiffusion
reduced FID by 14.9% and improved SSIM by 9.41% compared
to the state-of-the-art method.
The tasks on the G2A-3 dataset present certain challenges; however, our
method achieves significant performance improvements over the commonly
used image-conditioned synthesis method, ControlNet. SkyDiffusion
reduces the FID by an average of 19.60% and increases the SSIM by
an average of 9.90%.
We conducted ablation experiments on the datasets. Compared to
directly using street-view images as input, the Curved-BEV method
improves performance across multiple metrics by transforming
street-view images into satellite views for domain alignment. This
indicates that the Curved-BEV method aids in synthesizing more
content-consistent satellite images. Furthermore, the Multi-to-one
method further improves metrics compared to the one-to-one mapping,
demonstrating its effectiveness in dense urban scenes.
We applied an optional Light Manipulation module to align the lighting
conditions of the synthesized image with those of the target domain
image. The results indicate that Light Manipulation can effectively
improve SSIM and PSNR metrics. This module preserves the original
image content while providing more flexible lighting conditions for
the synthesized satellite image.
Conclusion
In this study, we introduce SkyDiffusion, a novel approach specifically
designed for street images to satellite images cross-view synthesis.
SkyDiffusion operates solely with street images as input, utilizing a
BEV Paradigm and diffusion models to generate satellite images.
SkyDiffusion achieves state-of-the-art performance in both content
consistency and image realism on across multiple cross-view datasets,
demonstrating its superior capabilities. Additionally, we introduce a
cross-view synthesis dataset, Ground2Aerial-3, featuring aerial image
synthesis tasks for multiple new scenes, providing practical value and
inspiration for future cross-view synthesis research.