Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

1Shanghai Artificial Intelligence Laboratory 2Sun Yat-Sen University

*Equal Contribution

Correspondence

Abstract

Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation.

Earth-Agent

We introduce Earth-Agent, an EO agent framework cast as a ReAct-style Partially Observable Markov Decision Process (POMDP). The LLM serves as the policy, iterating a loop of tool calling, memory update, deliberation, and action to solve tasks conditioned on goal and interaction history. Besides, Earth-Agent integrates 104 specialized tools across five functional kits, i.e. Index, Inversion, Perception, Analysis, and Statistics, spanning perceptual and spectral analysis. To evaluate both outcomes and reasoning, we adopt a dual-level protocol: end-to-end assessment of final Accuracy and trajectory Efficiency, and step-by-step checks of Tool-Any-Order, Tool-In-Order, Tool-Exact-Match, and Parameter Accuracy to characterize the completeness and fidelity of reasoning trajectories.

Earth-Bench

We propose Earth-Agent Benchmark (Earth-Bench), a dataset designed to evaluate tool-augmented EO agents in realistic Earth science analysis scenarios. The benchmark integrates three major types of Earth observation data: RGB Imagery (RGB), Raw Spectral Data (Spectrum), and Processed Earth Products (Products). It supports 14 representative tasks, including classification, detection, temperature monitoring, weather forecasting, etc., with a particular emphasis on scientific analysis that requires quantitative reasoning rather than qualitative description. Besides, Earth-Bench incorporates both regimes: Auto-Planning corresponds to the step-implicit setting and evaluates the agent's ability to autonomously plan its solution trajectory, while Instruction-Following corresponds to the step-explicit setting and evaluates the agent's ability to follow and translate human instructions into executable actions.
Compared to previous datasets, Earth-Bench addresses key drawbacks and offers:
  • Beyond MLLM Benchmarks: Existing MLLM benchmarks (e.g., RSVQA-HR, EarthVQA, VRSBench, Geo-Bench) are limited to single-step RGB perception without tool use or quantitative reasoning. In contrast, Earth-Bench introduces cross-modality integration, tool-augmented analysis, and multi-step reasoning to address these limitations.
  • Advancing Agent Benchmarks: Unlike PEACE, Thinkgeo, and UnivEarth, Earth-Bench scales up with 13K+ samples, 104 tools, and an average of >5 reasoning steps, emphasizing quantitative analysis and trajectory diagnostics.
  • Overall, Earth-Bench systematically evaluates EO agents on quantitative scientific reasoning and reasoning trajectory reproducibility, providing a valuable benchmark for advancing tool-augmented EO intelligence.

Earth-Agent with different LLM backbone

We evaluate 3 closed-source and 10 leading open-source LLMs. For closed-source models, we consider GPT-5, GPT-4o, and Gemini-2.5. For open-source models, including Deepseek-V3.1, Kimik2, Qwen3-max-Preview, Qwen3-32B and InternVL3.5, which represent the smartest open LLMs available to date.
Our key findings are as follows:
  • Closed-source LLMs achieve higher final accuracy, while open-source models excel in tool-use accuracy and reasoning alignment.
  • Instruction-following improves tool calling but does not always raise final accuracy, and in some advanced models it can even reduce it.
  • Models identify correct tools reliably, but irrelevant reasoning steps hinder exact matching and parameter execution, which is key bottlenecks for accurate EO data processing.

Comparison with general agents

Since many Earth-Bench tasks involve processing hundreds of images, existing open-source agent frameworks cannot handle these questions due to input size constraints. To enable fair comparison, we construct Earth-Bench-Lite, a reduced yet representative subset that preserves modality diversity while remaining within the capacity of general-purpose agents. It consists of 60 questions evenly distributed across the three EO modalities: Spectrum, Products, and RGB.
By comparison, general agents show limited modality coverage. They can handle relatively simple Spectrum tasks by writing ad-hoc code, but perform poorly on Products tasks due to the lack of domain-specific spatiotemporal analysis tools. For the RGB modality, MGX and Coze even fail to complete any tasks. In contrast, by interacting with 104 predefined geoscience tools, Earth-Agent consistently achieves superior performance across all three modalities, whether driven by the closed-source GPT-5 or the open-source DeepSeek-V3.1.

Comparison with MLLM-based EO methods

We further compare Earth-Agent with remote sensing large models on classification, detection, and segmentation tasks.
Earth-Agent outperforms existing MLLMs on classification, detection, and segmentation benchmarks. Prior MLLM-based systems show limited generalization, for instance, LHRS-Bot performs well on classification but fails on detection and grounding, while VHM achieves high classification accuracy yet cannot handle detection or segmentation. In contrast, Earth-Agent leverages a predefined toolkit of 104 geoscience functions and expert models, enabling adaptive tool use and robust performance across modalities. This modular design overcomes the limited extensibility of previous approaches.

Earth-Agent with Different LLM Backbones

Case Study: Comparison with Other Agents

BibTeX

@misc{feng2025earthagentunlockinglandscapeearth,
        title={Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents}, 
        author={Peilin Feng and Zhutao Lv and Junyan Ye and Xiaolei Wang and Xinjie Huo and Jinhua Yu and Wanghan Xu and Wenlong Zhang and Lei Bai and Conghui He and Weijia Li},
        year={2025},
        eprint={2509.23141},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.23141}, 
  }