LEGION: Learning to Ground and Explain for Synthetic Image Detection

1Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory
3Beihang University 4Sun Yat-Sen University 5SenseTime Research

*Equal Contribution

Correspondence

arXiv code dataset

Abstract

The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model~(MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.

SynthScars Overview

We propose SynthScars, a diverse and challenging dataset tailored for fully synthetic image forgery analysis. It features 12,236 AI-generated images produced by multiple generators, covering 4 distinct image content types (Human, Object, Scene, Animal), 3 categories of artifacts (Physics, Distortion, Structure), and fine-grained annotations including pixel-level masks, detailed textual explanations, and artifact category labels.
Compared to previous datasets, SynthScars addresses key drawbacks and offers:
  • High-Quality Synthetic Images. We meticulously curate challenging synthetic data from publicly available datasets such as Chameleon and RichHF-18K, ensuring a diverse selection of AI-generated images. These images are produced by various cutting-edge generators, including Midjourney, DALLE-3, and Stable Diffusion fine-tuned with LoRA, etc.
  • Style and Domain Consistency. To focus on detecting photorealistic synthetic images, which pose significant security risks and are challenging for humans to distinguish, we carefully filter out overly artistic and stylized images (e.g., cartoon or watercolor style).
  • Dual Fine-Grained Annotations. Our dataset includes expert-annotated, multi-dimensional labels for 12,236 samples, covering pixel-level segmentation, detailed textual explanations, and artifact category labels, providing comprehensive information for synthetic image analysis.
  • challenging Artifact Types. Unlike traditional forgery datasets that often focus on object-level tampered artifacts with strong contour dependencies, SynthScars introduces a more challenging set of artifacts that require global reasoning, such as inconsistencies in lighting and shadows that violate physical laws, expanding the complexity and flexibility beyond locally tampered regions.

Architecture & Pipeline

We introduce LEGION, a multi-task framework for image forgery analysis, which serves as both a defender against forgery techniques and a controller for generative models:
  • As a defender, LEGION performs image authenticity detection, artifact localization, and anomaly-aware textual explanation generation. It exhibits strong generalization, robustness, and interpretability, enabling a deep-level forensic analysis of synthetic images.
  • As a controller, LEGION acts as an essential component in our image regeneration/inpainting pipelines, leveraging its precise artifact segmentation masks and powerful textual explanations to guide refined image descriptions and local artifact correction, ultimately enhancing the realism and perceptual quality of the generated images.

Quantitative Results for Localization

We compare LEGION with SOTA models for fully synthetic artifact localization, including traditional experts (HiFi-Net, TruFor, PAL4VST), object-grounding VLMs (Ferret, Griffon, LISA), and general-purpose MLLMs (InternVL2, Qwen2-VL, DeepSeek-VL2). We conduct in-domain assessment on the SynthScars and further evaluate generalization to unseen domains using LOKI and RichHF-18K.
Our key findings are as follows:
  • Compared with traditional experts, LEGION outperforms the strongest expert model, PAL4VST, by 10.65% in F1 score for Object category on SynthScars, and also consistently surpassing it on the other two datasets.
  • For object-grounding VLMs and general-purpose MLLMs, we observe that they struggle with two extreme behaviors: some models fail to identify foreground regions altogether (e.g., DeepSeek-VL2), while others overestimate artifacts, treating most of the image as a huge artifact (e.g., Ferret, Griffon, and Qwen2-VL), resulting in low mIoU but artificially high F1 scores. InternVL2 and LISA exhibit these extremes less severely, yet LEGION outperforms both across the majority of metrics.

Qualitative Cases for Localization

Quantitative Results for Explanation

We conduct a comparative analysis of the latest released open-source (e.g., DeepSeek-VL2) and closed-source models (e.g., December,2024 updated GPT-4o) with varying parameters. In our evaluation, we test on SynthScars and LOKI which contain detailed artifact explanations, measuring ROUGE-L for surface-level structural alignment and CSS for semantic equivalence to jointly assess both lexical coherence and contextual fidelity.
Our key findings are as follows:
  • LEGION achieves superior performance across both datasets, surpassing other multimodal large language models (MLLMs) with only 8B parameters.
  • LLaVA-v1.6 attains the second-best results.
  • DeepSeek-VL2 and GPT-4o exhibit degraded performance due to their excessively verbose and redundant outputs.

Qualitative Cases for Explanation

Detection Performance

We train LEGION on ProGAN and evaluate its cross-generator generalization on the UniversalFakeDetect benchmark. As a result, LEGION achieves the highest accuracy on GANs, CRN, and IMLE, secures competitive second-place accuracy on SITD, and maintains comparable detection performance in other generators.

More Cases of LEGION

The first row depicts the ground truth, while the second row shows the corresponding predictions generated by our LEGION.

Quantitative Results for Refinement

As a controller, we leverage LEGION's image forgery analysis capabilities to construct two pipelines—regeneration and inpainting that enhance image generation. In this section, we randomly sample 200 images from the SynthScars test set to conduct experiments and assess the quality and realism of regenerated and inpainted images using the Human Preference Score (HPS). The results indicate that after multiple iterations of optimization, the regeneration and inpainting pipelines improve the average HPS of images by 6.98% and 2.14%, respectively.

Qualitative Cases for Regeneration

We employ an iterative generation approach by combining prompt revision with a text-to-image (T2I) model. Specifically, it iteratively refines the description of artifact regions using artifact explanations from our model LEGION to eliminate ambiguity and inconsistencies.
Here are two distinct examples, analyzed as follows:
  • Case 1: LEGION detects a cartoonish style in the original image and refines the prompt with constraints like "natural lighting" and "realistic style". After one refinement round, the image becomes significantly more realistic.
  • Case 2: The woman's left pinky finger is deformed in the initial image. Guided by LEGION, subsequent prompts add hand-specific details, enabling the model to refine the region. Two rounds of optimization are needed to correct the structure and achieve a natural form.

Qualitative Cases for Inpainting

We also construct a pipeline to facilitate the inpainting process, by using an inpainting model to iteratively remove artifacts and progressively enhance image quality. Compared to regeneration, this approach better preserves non-artifact regions since the image is not entirely regenerated but selectively refined only for anomalous areas.
Here is a challenging example. In the original image, the left reflection on the water mismatches the wall's color, while the right reflection contains an unrealistic window shape, violating physical laws. Through multiple iterations, LEGION progressively identifies entire reflection region, highlighting color and shape discrepancies to guide the inpainting process. By the 3-rd iteration, the artifacted region is successfully refined, achieving high-quality restoration.

Citation

@misc{kang2025legionlearninggroundexplain,
        title={LEGION: Learning to Ground and Explain for Synthetic Image Detection}, 
        author={Hengrui Kang and Siwei Wen and Zichen Wen and Junyan Ye and Weijia Li and Peilin Feng and Baichuan Zhou and Bin Wang and Dahua Lin and Linfeng Zhang and Conghui He},
        year={2025},
        eprint={2503.15264},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.15264}
  }