FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Motivation

Scaling RL to high-dimensional robots remains challenging:

On-policy methods (PPO) are reliable but require massive amounts of simulation because they discard past data after each update.
Off-policy methods (SAC, TD3) can reuse experience and are more sample-efficient, but they often become unstable in high-dimensional control.

The instability of off-policy methods primarily stems from critic training. Critics are trained by minimizing a bootstrapped Bellman error:

$$\mathcal{L}_Q = \mathbb{E}_{(s,a,r,s') \sim \mathcal{B}} \left[ \left( Q_\theta(s, a) - (r + \gamma Q_{\bar{\theta}}(s', a')) \right)^2 \right]$$

In high-dimensional settings, target values at next state-action pairs $(s', a')$ are often poorly covered by the replay buffer, and since targets depend on the critic's own predictions, small errors compound through repeated bootstrapping. FlashSAC addresses this by scaling data diversity, constraining update dynamics, and scaling model size once stability is ensured.

Algorithm

1. Scaling Data Volume and Diversity

Massively Parallel Simulation (1024 Environments): Rapid accumulation of diverse trajectories, maintaining adequate coverage across high-dimensional state-action spaces.
Large-Capacity Replay Buffer (10M Transitions): Preserves long-tail experiences with 10× larger capacity than standard buffers.
Unified Entropy Target (Task-Agnostic): Fixed action standard deviation target ensures consistent exploration across different robot embodiments.
Noise Repetition (Zeta Distribution): Repeats sampled noise for $ k $ consecutive steps, inducing temporal correlation without per-environment state — a lightweight alternative to pink noise.

2. Constraining Update Dynamics

Explicitly bounds weight, feature, and gradient norms to prevent error amplification under bootstrapped critic updates:

Architecture Design

Inverted Residual Blocks: Inspired by Transformers, expands features to higher dimensions then projects back with residual connections
Pre-activation Normalization: Applies batch normalization before nonlinearities to handle non-stationary replay data
Post-RMS Normalization: Bounds per-sample feature norms to prevent rare inputs from destabilizing bootstrapping

Training Techniques

Cross-Batch Value Prediction: Concatenates current and next transitions into a single batch for consistent normalization statistics
Distributional Critic: Represents Q-values as categorical distributions over atoms, smoothing the optimization landscape
Adaptive Reward Scaling: Normalizes rewards to keep returns within fixed support while preserving optimal policy
Weight Normalization: Projects weight vectors onto unit-norm sphere after each update to prevent uncontrolled weight growth

3. Scaling Model for Faster Training

With stable off-policy learning, scaling laws from supervised learning apply: larger models trained with larger batches and fewer updates converge faster. FlashSAC uses a 2.5M-parameter, 6-layer network with batch size 2048 and a UTD ratio of 2/1024.

Citation

@article{kim2026flashsac,
  title={FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control},
  author={Kim, Donghu and Lee, Youngdo and Park, Minho and Kim, Kinam and Seno, Takuma and
          Nahrendra, I Made Aswin and Min, Sehee and Palenicek, Daniel and Vogt, Florian and
          Kragic, Danica and Peters, Jan and Choo, Jaegul and Lee, Hojoon},
  journal={arXiv preprint arXiv:2602},
  year={2026}
}

FlashSAC: Fast and Stable
Off-Policy Reinforcement Learning
for High-Dimensional Robot Control

TL;DR

Video Results

Low DoF

High DoF

Sim-to-Real (Flat)

Sim-to-Real (Rough)