-->

GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

1ShanghaiTech University    2ByteDance    3Shanghai Engineering Research Center of Intelligent Vision and Imaging
NeurIPS 2025

*Indicates Equal Contribution

Indicates Corresponding Author

Abstract

While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks.

Motivation

Contemporary GUI agents universally face two primary challenges when processing complex and diverse navigation tasks: poor generalization to unseen tasks, and a deficiency in long-horizon reasoning capabilities. For tasks requiring a continuous sequence of operations, existing methods for modeling historical information are inadequate, impairing the agent's ability to maintain decision-making coherence across extended interactions. To overcome these limitations, we argue that an agent, much like a human, must be able to both efficiently summarize past experiences and perform clear logical reasoning based on that summary. Therefore, we propose a reasoning-enhanced framework for GUI navigation. The core innovation of this framework is its departure from a one-time review of history; instead, it establishes a dynamic and cyclical reasoning mechanism. At each decision step, the agent compresses the entire interaction trajectory up to that point into a concise and condensed history summary. Critically, this newly generated summary then serves as the core context that is fed directly into a structured reasoning module for the subsequent step.

Overview

GUI-Rise agent framework overview. It introduces a three-subtask framework that integrates structured reasoning, action prediction, and history summarization. 
  At each step, the agent performs structured reasoning (progress estimation and decision analysis), predicts the next GUI action, and updates a compact history summary for the next iteration.

We design and propose GUI-Rise, a reasoning-augmented agent for GUI navigation. At the core of GUI-Rise is a three-stage sub-task framework that emulates the human "think-act-summarize" decision-making process. This ensures that the agent makes optimal judgments at each step based on sufficient historical information. During each interaction with the GUI environment, GUI-Rise sequentially executes the following three tightly-coupled sub-tasks:

  1. Structured Reasoning Subtask: The agent first evaluates the current task progress to determine the next course of action.
  2. Action Prediction Subtask: Based on the outcome of the reasoning stage, it then outputs an action in the specified format.
  3. History Summarization Subtask: Finally, it compresses the executed operation and the corresponding GUI state into a compact and coherent representation for future reference.

Training

Overview of the GUI-Rise training pipeline. The training consists of two stages: (1) supervised learning with pseudo-labeled summaries and ground truth action trajectories to initialize reasoning, and (2) reinforcement learning with rule-based and model-based rewards to improve decision-making and generalization.

To enable this structured reasoning capability, we develop a two-stage training paradigm. The first stage bootstraps the model on a small, synthetically labeled dataset to establish basic reasoning and summarization skills, while the second stage adopts reinforcement learning in a simulated GUI environment to refine task-specific reasoning strategies through interaction.

We employ Group Relative Policy Optimization (GRPO) with three complementary reward functions during the reinforcement learning stage:

  1. Format Reward: Enforcing structured reasoning through semantic tagging of components.
  2. Action Accuracy Reward: Evaluating the accuracy and consistency of predicted actions.
  3. History Summarization Reward: Assessing the history summary's quality for future decisions.

Experiments

To comprehensively evaluate the performance of GUI-Rise, we design evaluation criteria from two perspectives: (1) Out-of-domain Evolution. We use the GUIAct benchmark as the training set and conduct evaluations on offline benchmarks such as AITW and Mind2Web, as well as online benchmarks including MiniWob, OSWorld, and AndroidWorld. (2) In-domain Evolution. We utilize AITW, Mind2Web, and MiniWob benchmarks as training data, and evaluate on their respective test sets.

Ablation study of GUI-Rise on the AITW benchmark. ''TST'' denotes the use of two-stage training with RL; ''SCoT'' indicates the incorporation of structured Chain-of-Thought reasoning; ''HS'' means using history summary as input; ''HSR'' refers to applying the history summary reward during RL.

Ablation study of GUI-Rise on the AITW benchmark. ''TST'' denotes the use of two-stage training with RL; ''SCoT'' indicates the incorporation of structured Chain-of-Thought reasoning; ''HS'' means using history summary as input; ''HSR'' refers to applying the history summary reward during RL.

Ablation Studies. We report the results of ablation studies on GUI-Rise-2B using the AITW benchmark, aiming to assess the effects of components such as two-stage training, structured reasoning, and history summarization.

Visualization