Abstract
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks.
Motivation
Contemporary GUI agents universally face two primary challenges when processing complex and diverse navigation tasks: poor generalization to unseen tasks, and a deficiency in long-horizon reasoning capabilities. For tasks requiring a continuous sequence of operations, existing methods for modeling historical information are inadequate, impairing the agent's ability to maintain decision-making coherence across extended interactions. To overcome these limitations, we argue that an agent, much like a human, must be able to both efficiently summarize past experiences and perform clear logical reasoning based on that summary. Therefore, we propose a reasoning-enhanced framework for GUI navigation. The core innovation of this framework is its departure from a one-time review of history; instead, it establishes a dynamic and cyclical reasoning mechanism. At each decision step, the agent compresses the entire interaction trajectory up to that point into a concise and condensed history summary. Critically, this newly generated summary then serves as the core context that is fed directly into a structured reasoning module for the subsequent step.
Overview
We design and propose GUI-Rise, a reasoning-augmented agent for GUI navigation. At the core of GUI-Rise is a three-stage sub-task framework that emulates the human "think-act-summarize" decision-making process. This ensures that the agent makes optimal judgments at each step based on sufficient historical information. During each interaction with the GUI environment, GUI-Rise sequentially executes the following three tightly-coupled sub-tasks:
- Structured Reasoning Subtask: The agent first evaluates the current task progress to determine the next course of action.
- Action Prediction Subtask: Based on the outcome of the reasoning stage, it then outputs an action in the specified format.
- History Summarization Subtask: Finally, it compresses the executed operation and the corresponding GUI state into a compact and coherent representation for future reference.
Training
To enable this structured reasoning capability, we develop a two-stage training paradigm. The first stage bootstraps the model on a small, synthetically labeled dataset to establish basic reasoning and summarization skills, while the second stage adopts reinforcement learning in a simulated GUI environment to refine task-specific reasoning strategies through interaction.
We employ Group Relative Policy Optimization (GRPO) with three complementary reward functions during the reinforcement learning stage:
- Format Reward: Enforcing structured reasoning through semantic tagging of components.
- Action Accuracy Reward: Evaluating the accuracy and consistency of predicted actions.
- History Summarization Reward: Assessing the history summary's quality for future decisions.
Experiments
Out-of-domain evaluation results on the Mind2Web benchmark. The table reports performance under two settings: (1) the standard setting, where models are trained on the Mind2Web training set and evaluated on its test set; and (2) the zero-shot setting, where models are trained on the GUIAct training set and evaluated on the Mind2Web test set.
Evaluation results on the AITW benchmark. The table reports performance under two settings: (1) the in-domain setting, where models are trained on the AITW training set and evaluated on its test set; and (2) the zero-shot setting, where models are trained on the GUIAct training set and evaluated on the AITW test set.
Success rate (%) on online benchmarks. MiniWob(ZS) follows the zero-shot setting and MiniWob(FT) follows the fine-tuning setting, and * means results are evaluated on the chrome-split of OSWorld.
To comprehensively evaluate the performance of GUI-Rise, we design evaluation criteria from two perspectives: (1) Out-of-domain Evolution. We use the GUIAct benchmark as the training set and conduct evaluations on offline benchmarks such as AITW and Mind2Web, as well as online benchmarks including MiniWob, OSWorld, and AndroidWorld. (2) In-domain Evolution. We utilize AITW, Mind2Web, and MiniWob benchmarks as training data, and evaluate on their respective test sets.
Ablation study of GUI-Rise on the AITW benchmark. ''TST'' denotes the use of two-stage training with RL; ''SCoT'' indicates the incorporation of structured Chain-of-Thought reasoning; ''HS'' means using history summary as input; ''HSR'' refers to applying the history summary reward during RL.
Ablation Studies. We report the results of ablation studies on GUI-Rise-2B using the AITW benchmark, aiming to assess the effects of components such as two-stage training, structured reasoning, and history summarization.
Visualization
A case study from the Mind2Web dataset illustrates detailed reasoning across two consecutive steps.