AsyncWebRL

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

1 UIUC 2 Microsoft 3 CMU

Highlights

Abstract

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations — an everlasting rollout pool and lightweight screenshot handling — that together deliver up to a 2.9× end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer $1/|\tau_i|$ in multi-step GRPO as the root cause of trajectory- and token-level inefficiency: failures are systematically longer than successes, so it down-weights the negative gradient on failed tokens, and the policy keeps producing verbose memory schemas. Replacing $1/|\tau_i|$ with a constant $1/k$ breaks this coupling, contracting trajectories while preserving aggregate success. Together these contributions set a new open-source state of the art on the WebGym OOD test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% Medium, +48% Hard).

Two Contributions, One Framework

System

The first open multi-step RL framework for visual web agents that is fully async end-to-end. Overlaps rollout, gradient update, and policy refresh across iteration boundaries with an everlasting rollout pool (no warm-up bubble at every iteration boundary) and lightweight screenshot handling (image tensors stay in a dedicated in-memory actor; only references travel through RPC).

Algorithm: decoupled loss + constant $1/k$

Two changes to the loss matter equally for multi-step agentic RL. (1) A decoupled importance-sampling ratio centers PPO clipping on a proximal policy, so the stale off-policy rollouts that async training produces stay usable — halving the clip-trigger rate. (2) A constant $1/k$ step normalizer replaces the per-trajectory $1/|\tau_i|$, the root cause of trajectory- and token-level inefficiency: failures average 12.5 steps vs. 5.1 for successes, so $1/|\tau_i|$ attenuates the negative gradient by ~2.4×. The constant contracts trajectories while preserving success, with the largest gains on Medium / Hard OOD.

System

Click each tab to switch view.

Async architecture
Multi-step asynchronous management. Colored blocks are concurrent rollout workers, gradient updates on $\pi_t$, and policy refreshes that broadcast new weights to workers. White gaps under sync RL (top) are bubble time. AsyncWebRL (bottom) eliminates these gaps by maintaining an everlasting rollout pool so rollout, update, and refresh overlap continuously.
Screenshot management
Lightweight screenshot handling. WebGym (left) serializes every high-resolution screenshot through the shared RPC object store and spills to disk under concurrent rollouts. AsyncWebRL (right) keeps all image tensors in a dedicated in-memory actor and routes only lightweight references through RPC.
Trajectories per hour
Cumulative trajectories over a 24-hour run
Left: Instantaneous throughput (trajectories / hour). Right: Cumulative trajectories under a 24‑hour training budget — the async lines pull ~40 k trajectories ahead of sync by day's end. Below: Wall‑clock time to collect 1,000 trajectories. AsyncWebRL produces ~3,100 traj/h on both Instruct and Thinking vs. ~1,300 / ~1,050 for sync WebGym, a 2.4–2.9× end-to-end speedup.
Off-policyness during GRPO training. With max staleness $\eta = 2$, the mean per-token off-policy gap stays near 1.5 and the max near 2.0, well below the cap (dashed).

Algorithm

If you build multi-step agentic RL, two changes to the loss matter equally: (1) a decoupled importance-sampling ratio — clip on a proximal policy so the stale off-policy rollouts of async training remain usable (clip-trigger rate roughly halved), and (2) a constant $1/k$ step normalizer in place of the per-trajectory $1/|\tau_i|$ — so the loss stops under-weighting long failures and trajectories contract at matched success. Neither alone is enough.

From vanilla PPO to the AsyncWebRL loss — click each step to see the formula evolve.

1 2 3 4 5
Vanilla PPO
Standard policy-gradient surrogate with clipped trust region
\(\mathcal{J}_{\text{PPO}}(\theta) =\) \(\mathbb{E}_{t}\) \(\Bigg[\) \(\min\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}}\hat{A}_t,\ \text{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}},1{-}\epsilon,1{+}\epsilon\big)\hat{A}_t\big)\) \(\Bigg]\)
Standard PPO: maximize expected advantage of the new policy, with a clipped trust region around the old policy. Single-turn, no group structure.
\(\mathcal{J}_{\text{GRPO}}(\theta) =\) \(\mathbb{E}\) \(\Bigg[\) \(\tfrac{1}{G}\) \(\displaystyle\sum_{i=1}^{G}\) \(\min\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}}\hat{A}_i,\ \text{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}},1{-}\epsilon,1{+}\epsilon\big)\hat{A}_i\big)\) \(\Bigg]\)
Change from previous: add the group-relative critic-free advantage. Roll out $G$ trajectories per task and use $\hat{A}_i = (r_i - \mathrm{mean}(\mathbf{r}))/\mathrm{std}(\mathbf{r})$. The $1/G\sum_{i=1}^{G}$ appears on the left; the $\min(\cdot)$ slides right to make room.
\(\mathcal{J}(\theta) =\) \(\mathbb{E}\) \(\Bigg[\) \(\tfrac{1}{G\,\class{new-bit}{|\tau_i|}}\) \(\displaystyle\sum_{i=1}^{G}\class{new-bit}{\sum_{j=1}^{|\tau_i|}\sum_{t=1}^{|\tau_{i,j}|}}\) \(\min\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}}\hat{A}_i,\ \text{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}},1{-}\epsilon,1{+}\epsilon\big)\hat{A}_i\big)\) \(\Bigg]\)
Change from previous: $|\tau_i|$ slips into the outer normalizer and two new $\sum_j\sum_t$ join $\sum_i$. Failures average 12.5 steps vs. 5.1 for successes, so $1/|\tau_i|$ attenuates the gradient on long failures by ~2.4× — the issue we identify.
\(\mathcal{J}(\theta) =\) \(\mathbb{E}\) \(\Bigg[\) \(\tfrac{1}{G\cdot\class{new-bit}{k}}\) \(\displaystyle\sum_{i=1}^{G}\sum_{j=1}^{|\tau_i|}\sum_{t=1}^{|\tau_{i,j}|}\) \(\min\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}}\hat{A}_i,\ \text{clip}\!\big(\tfrac{\pi_\theta}{\pi_{\text{old}}},1{-}\epsilon,1{+}\epsilon\big)\hat{A}_i\big)\) \(\Bigg]\)
Change from previous: $|\tau_i|$ in the outer normalizer is replaced by a constant $k$ (the Easy-difficulty horizon, $k{=}10$). One-line fix; restores full per-token gradient weight on long failures, contracting trajectories at matched test reward.
\(\mathcal{J}(\theta) =\) \(\mathbb{E}_{\tau\sim\class{new-bit}{\pi_{\text{behave}}}}\) \(\Bigg[\) \(\tfrac{1}{G\cdot k}\) \(\displaystyle\sum_{i=1}^{G}\sum_{j=1}^{|\tau_i|}\sum_{t=1}^{|\tau_{i,j}|}\) \(\min\!\big(\tfrac{\pi_\theta}{\class{new-bit}{\pi_{\text{behave}}}}\hat{A}_i,\ \class{new-bit}{\tfrac{\pi_{\text{prox}}}{\pi_{\text{behave}}}}\text{clip}\!\big(\class{new-bit}{\tfrac{\pi_\theta}{\pi_{\text{prox}}}},1{-}\epsilon,1{+}\epsilon\big)\hat{A}_i\big)\) \(\Bigg]\)
Change from previous: for async RL, trajectories are collected by a stale behavior policy $\pi_{\text{behave}}$, not $\pi_\theta$. We factor $\pi_\theta/\pi_{\text{behave}} = (\pi_\theta/\pi_{\text{prox}}) \cdot (\pi_{\text{prox}}/\pi_{\text{behave}})$ and center PPO-style clipping on a proximal policy $\pi_{\text{prox}}\!\approx\!\pi_\theta$. This halves the clip-trigger rate. The paper's Eq. 1.
Step 1 of 5

Why the Decoupled Loss Matters

Async RL learns from trajectories collected by a stale behavior policy $\pi_{\text{behave}}$, not the current $\pi_\theta$. A naive (coupled) importance ratio $\pi_\theta/\pi_{\text{behave}}$ drifts far from 1 under this off-policyness, so PPO's clip fires constantly and throws away the very samples async is producing — the throughput the system buys gets wasted on clipped gradients. Decoupling factors the ratio through a proximal policy $\pi_{\text{prox}}\!\approx\!\pi_\theta$ and centers clipping there, keeping stale rollouts usable. It is what makes multi-step async RL actually learn, and it is one of our two equally-important algorithmic ingredients alongside the constant $1/k$ normalizer.

Training reward
$\epsilon$-clip fraction
Dual-clip fraction
Coupled vs. decoupled importance sampling. Decoupling roughly halves the clip-trigger rate and speeds reward improvement.
 decoupled loss     coupled loss

Emergent Behaviors of Agent Memory Under Step-Number Normalization (And Why You Shouldn't Use It)

At every step the WebGym agent writes a free-form, append-only Memory JSON — running notes it keeps about the task.

🖥️
Screenshot
observation
🧠
Qwen3-VL
agent
🖱️
Action
click · type · scroll
the agent's full response at one step — memory-additive format: Memory · Progress · Intention · Action · tool_call
Memory: {
  "task": "Find a vegetarian dinner recipe with a photo and an ingredient list",
  "current_url": "https://www.foodnetwork.com/search/vegetarian-dinner",
  "candidate_recipe": "Baked Penne with Roasted Vegetables"
}
Progress: {
  "Open foodnetwork.com": "finished",
  "Search for a vegetarian dinner": "finished",
  "Open a candidate recipe": "not finished",
  "Verify photo + ingredient list": "not finished"
}
Intention: "Open a candidate recipe"
Action: Click the "Baked Penne with Roasted Vegetables" search result.
<tool_call>{"name": "click", "arguments": {"element": "Baked Penne with Roasted Vegetables"}}</tool_call>
↻  At each step the policy conditions on its previous response (the blue Memory block included) plus the latest screenshot, and may only append to Memory — never edit earlier entries. So the notes accumulate over the rollout, and what kind of keys the policy tends to append is itself a learned behavior.
The per-step loop. Because Memory is append-only and rides along in the agent's context, it accumulates across a rollout — which makes it a clean place to read off how RL has changed the agent's behavior.

The designed memory mechanism, before RL (illustrative)

One rollout on the task “Find a vegetarian dinner recipe that includes a photo and an ingredient list.” The agent reads each screenshot, acts, and appends to Memory — which only ever grows.

1Open foodnetwork.com
foodnetwork.com
🔍
🖼️
Memory: {
  "task": "Find a vegetarian dinner recipe with a photo and ingredients"
}
2Type “vegetarian dinner” into search
foodnetwork.com/search
🔍 vegetarian dinner
Memory: {
  "task": "Find a vegetarian dinner recipe with a photo and ingredients",
  "search_query": "vegetarian dinner"
}
3Click “Baked Penne with Roasted Vegetables”
foodnetwork.com/search?q=vegetarian+dinner
🔍 vegetarian dinner
Baked Penne with Roasted Vegetables👆
Memory: {
  "task": "Find a vegetarian dinner recipe with a photo and ingredients",
  "search_query": "vegetarian dinner",
  "candidate_recipe": "Baked Penne with Roasted Vegetables"
}
4Confirm photo + ingredient list present → submit
foodnetwork.com/recipes/baked-penne
Baked Penne with Roasted Vegetables
🖼️ recipe photo
Ingredients
✓ penne pasta
✓ roasted vegetables
✓ marinara sauce
Memory: {
  "task": "Find a vegetarian dinner recipe with a photo and ingredients",
  "search_query": "vegetarian dinner",
  "candidate_recipe": "Baked Penne with Roasted Vegetables",
  "photo_and_ingredients_present": true
}

How RL Changes the Agent's Memory Behavior

RL never edits Memory directly — it only updates the policy by gradient descent. But that changes the agent's behavior, and a changed behavior produces Memory with very different properties. The mechanism: in our setting failure is dominated by horizon exhaustion (running out of steps), not clearly-wrong actions, and failed rollouts are far longer than successful ones (12.5 vs. 5.1 steps). Since $1/|\tau_i|$ gives every trajectory the same total weight regardless of length, each token in a long failure carries only a $1/|\tau_i|$ share — attenuating the penalty by $\approx\!2.4\times$ on exactly the rollouts the policy should learn to avoid. Each panel below is Memory at step 4 of one representative rollout; they type at the same speed, so verbosity shows as time.

The surprise: we never change the token-level loss. The only knob we touch is the per-trajectory step-count normalizer — $1/|\tau_i| \rightarrow 1/k$, a weight over whole trajectories, identical for every token within one. Yet the token-level behavior — what the policy actually writes, token by token, into Memory — diverges sharply. A purely trajectory-level reweighting silently reshapes token-level generation.

Base (no RL). The pre-RL starting point: the policy emits a compact Memory — a site pointer, the task, a short note. We show it only as the reference the two RL runs depart from; on its own it carries no claim. The signal is the difference between the next two tabs.

Base (no RL)

                

GRPO with $1/|\tau_i|$ (length norm). Trained with the per-trajectory normalizer, the agent's behavior shifts toward appending a fresh generic slot almost every step — current_page, search_term, search_result, recipe_title, recipe_page, … Across these runs 34% of all keys are generic placeholders and only 7% of trajectories keep their key set to the end. Because the loss barely penalizes the long failed rollouts, padding Memory is nearly free — so the behavior emerges.

GRPO  ·  $1/|\tau_i|$ (length norm)

                

GRPO with $1/k$ (no length norm). Replacing the normalizer with a constant restores full weight on the long failures, and the behavior moves the other way: the agent commits to a small set of task-anchored keys early and holds them (generic-slot keys drop to 11%). The resulting Memory is compact and semantically grounded — roughly 3× shorter than the $1/|\tau_i|$ schema at matched task success.

GRPO  ·  $1/k$ (no length norm)

                

generic-slot bloat under $1/|\tau_i|$  ·  task-anchored schema under $1/k$  ·  ↻ replay

Easy / Success
Easy / Failure
Medium / Success
Medium / Failure
Memory JSON keys per agent step, split by outcome and difficulty. $1/|\tau_i|$ tracks the one-new-key-per-step diagonal; the constant $1/k$ fix stays close to Base.
Base   GRPO (length norm)   GRPO (no length norm)
The fuller ablation (RAFT++, prompt and horizon sweeps) is in the paper.

The curve quantifies the same shift in behavior: under $1/|\tau_i|$ the per-step Memory key count climbs along the one-new-key-per-step diagonal — steepest on the failure rollouts the loss under-penalizes — while the $1/k$ run stays close to Base. Test reward is essentially tied between the two; the difference is purely in how many tokens the agent spends getting there.

Training dynamics

In aggregate the takeaway is simple: removing length normalization yields drastically shorter responses at essentially the same task performance. Test reward is statistically tied between the two losses (first panel), yet the constant $1/k$ run uses markedly fewer steps per trajectory and fewer tokens per step, while $1/|\tau_i|$ keeps inflating both. (Per-token entropy falls under $1/|\tau_i|$ because the extra tokens are low-entropy Memory boilerplate.)

Test reward
Steps / traj
Per-token entropy
Tokens / step
Test reward
Steps / traj
Per-token entropy
Tokens / step
Effect of $1/|\tau_i|$ on GRPO training dynamics. Test reward is essentially tied between the two losses, but $1/|\tau_i|$ produces longer trajectories, longer per-step responses, and lower per-token entropy.
 GRPO (length norm, $1/|\tau_i|$)     GRPO (no length norm, constant $1/k$)

Performance Analysis

Peak test success rate (%) on the WebGym OOD test split. Best per column in bold.

Method Easy Medium Hard Avg
Base (no RL) 32.511.20.026.2
WebGym (sync REINFORCE) 50.924.14.842.9
AsyncWebRL-RAFT++ 46.627.85.539.3
AsyncWebRL (full) 52.434.37.145.4
Method Easy Medium Hard Avg
Base (no RL) 37.424.31.232.0
AsyncWebRL-RAFT++ 47.330.05.240.5
AsyncWebRL (full) 51.835.111.344.4

Training Curves

Click tabs to switch difficulty slice; points reveal one at a time.

Test success rate vs. training trajectories. AsyncWebRL (full), AsyncWebRL-RAFT++, dashed gray WebGym (Instruct only).

Ablations

Test SR (%)
Training reward
Policy entropy
RAFT++ learning-rate sweep on Qwen3-VL-8B-Instruct. Sweeping LR confirms the gap to GRPO is not a tuning artifact.  lr=5e-6    lr=1e-5
Training reward
Policy entropy
Steps / traj
Why GRPO outperforms RAFT++. RAFT++ is behavior-cloning on a slowly changing positive buffer; cannot push down dominant failure modes on harder slices.  GRPO (length norm)    RAFT++
Test reward
Training reward
Policy entropy
Steps / traj
Tokens / step
Batch size 128 vs. 32 for GRPO Instruct, against wall-clock hours.  batch=128    batch=32

BibTeX

@article{bai2026asyncwebrl,
  title     = {AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents},
  author    = {Bai, Hao and Yang, Rui and Ye, Chenlu and Whitehead, Spencer and Kumar, Aviral and Zhang, Tong},
  journal   = {arXiv preprint arXiv:2606.05597},
  year      = {2026}
}