Abstract
Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation.
Earth-Agent
Earth-Bench
- Beyond MLLM Benchmarks: Existing MLLM benchmarks (e.g., RSVQA-HR, EarthVQA, VRSBench, Geo-Bench) are limited to single-step RGB perception without tool use or quantitative reasoning. In contrast, Earth-Bench introduces cross-modality integration, tool-augmented analysis, and multi-step reasoning to address these limitations.
- Advancing Agent Benchmarks: Unlike PEACE, Thinkgeo, and UnivEarth, Earth-Bench scales up with 13K+ samples, 104 tools, and an average of >5 reasoning steps, emphasizing quantitative analysis and trajectory diagnostics.
- Overall, Earth-Bench systematically evaluates EO agents on quantitative scientific reasoning and reasoning trajectory reproducibility, providing a valuable benchmark for advancing tool-augmented EO intelligence.
Earth-Agent with different LLM backbone
- Closed-source LLMs achieve higher final accuracy, while open-source models excel in tool-use accuracy and reasoning alignment.
- Instruction-following improves tool calling but does not always raise final accuracy, and in some advanced models it can even reduce it.
- Models identify correct tools reliably, but irrelevant reasoning steps hinder exact matching and parameter execution, which is key bottlenecks for accurate EO data processing.
Comparison with general agents
Comparison with MLLM-based EO methods
Earth-Agent with Different LLM Backbones
Example of Climate Analysis with Spectrum Data under the Auto-Planning Regime.
Example of Climate Analysis with Spectrum Data under the Instruction-Following Regime.
Example of Disaster Judgement with Spectrum Data under the Auto-Planning Regime.
Example of Disaster Judgement with Spectrum Data under the Instruction-Following Regime.
Example of Temperature Monitoring with Spectrum Data under the Auto-Planning Regime.
Example of Temperature Monitoring with Spectrum Data under the Instruction-Following Regime.
Example of Urban Management with Spectrum Data under the Auto-Planning Regime.
Example of Urban Management with Spectrum Data under the Instruction-Following Regime.
Example of Vegetation Monitoring with Spectrum Data under the Auto-Planning Regime.
Example of Vegetation Monitoring with Spectrum Data under the Instruction-Following Regime.
Example of Pollution Regulation with Products Data under the Auto-Planning Regime.
Example of Pollution Regulation with Products Data under the Instruction-Following Regime.
Example of Water Management with Products Data under the Auto-Planning Regime.
Example of Water Management with Products Data under the Instruction-Following Regime.
Example of Weather Management with Products Data under the Auto-Planning Regime.
Example of Weather Management with Products Data under the Instruction-Following Regime.
Example of Change Detection with RGB Data under the Auto-Planning Regime.
Example of Change Detection with RGB Data under the Instruction-Following Regime.
Example of Classification with RGB Data under the Auto-Planning Regime.
Example of Classification with RGB Data under the Instruction-Following Regime.
Case Study: Comparison with Other Agents
A Question Case of the Urban Management Task using Products Data with Responses from Different Agent.
A Question Case of the Urban Management Task using Products Data with Responses from Different Agent.
A Question Case of the Change Detection Task using RGB Data with Responses from Different Agent.
A Question Case of the Classification Task using RGB Data with Responses from Different Agent.
A Question Case of the Classification Task using RGB Data with Responses from Different Agent.
A Question Case of the Detection Task using RGB Data with Responses from Different Agent.
A Question Case of the Visual Grounding Task using RGB Data with Responses from Different Agent.
BibTeX
@misc{feng2025earthagentunlockinglandscapeearth,
title={Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents},
author={Peilin Feng and Zhutao Lv and Junyan Ye and Xiaolei Wang and Xinjie Huo and Jinhua Yu and Wanghan Xu and Wenlong Zhang and Lei Bai and Conghui He and Weijia Li},
year={2025},
eprint={2509.23141},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.23141},
}