gWorld: Generative Visual Code Mobile World Models
Paper

gWorld: Generative Visual Code Mobile World Models

2026.02.06
·Web·by 이호민
#LLM#VLM#World Models#GUI Agent#Code Generation

Key Points

  • 1The paper proposes gWorld, a novel visual mobile GUI World Model paradigm that predicts future GUI states by generating renderable web code, effectively combining the linguistic precision of text-based models with the visual fidelity of pixel-based approaches.
  • 2This generative visual code approach virtually eliminates structural errors (<1% Render Fail) and offers a simplified pipeline compared to previous visual models that relied on numerous external components for text rendering.
  • 3gWorld (8B, 32B) establishes a new Pareto frontier in accuracy versus model size on the MWMBench benchmark, significantly outperforming much larger frontier open-weight models while demonstrating predictable performance gains with increased training data.

The paper introduces a novel paradigm for Mobile Graphical User Interface (GUI) World Models (WMs), addressing the limitations of existing approaches. Traditional text-based WMs sacrifice visual fidelity, while visual WMs struggle with precise text rendering and often rely on complex, multi-model pipelines.

The core innovation is visual world modeling via renderable code generation. Instead of directly generating pixels or relying solely on text, a single Vision-Language Model (VLM) is trained to predict the next GUI state as executable web code (e.g., structured representation that can be rendered to pixels). This approach leverages the VLM's linguistic priors for accurate text rendering and its pre-training on structured web code for high-fidelity visual generation, effectively combining the strengths of both text-based and visual approaches.

The proposed models, gWorld (8B, 32B), are the first open-weight visual mobile GUI WMs built on this paradigm.

Core Methodology - Data Generation Pipeline:
The paper details a three-step data generation framework to synthesize code-based training data:

  1. Repurposing Policy Trajectory: Existing offline policy trajectories, typically in the format of current GUI states (StS_t) and actions (AtA_t), are repurposed into world modeling triplets: {St,At,St+1}\{S_t, A_t, S_{t+1}\}. Here, StS_t and St+1S_{t+1} represent the visual GUI states (images), and AtA_t is the action performed (e.g., click coordinates, text input).
  1. Synthetic Cross-modal Re-labeling: This is the critical step for generating the target code. The ground-truth next state, St+1S_{t+1} (in its pixel/visual form), is converted into a renderable web code representation, denoted as Ct+1C_{t+1}. This "cross-modal re-labeling" is performed using a "frontier VLM," which analyzes the visual St+1S_{t+1} and translates its content and layout into a structured, renderable code format. This generated code Ct+1C_{t+1} then serves as the target output for the gWorld model during training. The gWorld model learns to generate this code given StS_t and AtA_t.
  1. Reasoning Data with Look-ahead: To enhance the VLM's predictive capabilities, reasoning traces (RtR_t) are synthesized. With access to the ground-truth target state (either St+1S_{t+1} or its code representation Ct+1C_{t+1}), the system can generate explanations or rationales for *why* a particular next state is predicted given the current state and action. This look-ahead capability allows for the creation of richer training data that can guide the VLM in understanding the causal relationship between actions and state transitions. The VLM is effectively trained to generate Ct+1C_{t+1} and potentially RtR_t conditioned on StS_t and AtA_t.

Evaluation and Results:
The paper introduces MWMBench (Mobile World Model Bench), a comprehensive benchmark for evaluating mobile GUI world models. It features:

  • Evaluation in the native visual modality.
  • Real-world coordinate-based action spaces.
  • In-distribution (AitW, GUIOdyssey, AndroidControl, AMEX) and out-of-distribution (AndroidWorld, KApps) datasets.

Performance is measured using three key metrics:

  • Instruction Accuracy (IAcc.): The percentage of times the model correctly predicts the next state corresponding to the ground truth.
  • Render Fail: The percentage of generated outputs that cannot be successfully rendered into a visual GUI state.
  • Similarity: A metric quantifying the visual resemblance between the rendered predicted state and the ground-truth state.

Key Findings:

  • gWorld models establish a new Pareto frontier in accuracy versus model size. The 8B and 32B gWorld models significantly outperform 8 frontier open-weight models, some of which are up to 50.25 times larger (e.g., Llama 4 402B).
  • gWorld 8B achieves an average IAcc. of 74.9%, and gWorld 32B achieves 79.6%, far surpassing models like Qwen3 VL 32B (52.5%) and even GLM-4.6V 106B (67.4%).
  • The code-based approach virtually eliminates structural errors, achieving Render Fail rates of less than 1% (0.6% for gWorld 32B), a drastic improvement over other VLMs (e.g., Qwen3 VL 8B at 40.1%, Llama 4 402B at 9.2%). This indicates the generated code is highly renderable and structurally sound.
  • gWorld models also maintain competitive visual similarity scores (71.4% for gWorld 32B).
  • Scaling training data yields predictable gains following a power law (R2β‰₯0.94R^2 \ge 0.94), suggesting non-saturating improvements with more data.

In essence, gWorld demonstrates that generating renderable code for GUI state prediction offers a superior trade-off, enabling precise visual fidelity, robust structural integrity, and competitive performance with significantly smaller model sizes.