FlashSAC: Fast and Stable
Off-Policy Reinforcement Learning
for High-Dimensional Robot Control

A fast and stable off-policy RL algorithm that achieves the highest asymptotic performance in the shortest wall-clock time among existing methods for high-dimensional sim-to-real robotic control

1Holiday Robotics 2KAIST 3KRAFTON 4Turing Inc 5TU Darmstadt 6hessian.AI 7KTH Royal Institute of Technology 8German Research Center for AI (DFKI) * equal contribution

TL;DR

If you're using PPO, try FlashSAC!

Video Results

Low DoF

State-based Low DoF Learning Curve

High DoF

State-based High DoF Learning Curve

Sim-to-Real (Flat)

Sim-to-Real (Flat)

Sim-to-Real (Rough)

Sim-to-Real (Rough)
G1 Stair

Motivation

Scaling RL to high-dimensional robots remains challenging:

The instability of off-policy methods primarily stems from critic training. Critics are trained by minimizing a bootstrapped Bellman error:

$$\mathcal{L}_Q = \mathbb{E}_{(s,a,r,s') \sim \mathcal{B}} \left[ \left( Q_\theta(s, a) - (r + \gamma Q_{\bar{\theta}}(s', a')) \right)^2 \right]$$

In high-dimensional settings, target values at next state-action pairs \((s', a')\) are often poorly covered by the replay buffer, and since targets depend on the critic's own predictions, small errors compound through repeated bootstrapping. FlashSAC addresses this by scaling data diversity, constraining update dynamics, and scaling model size once stability is ensured.

Algorithm

1. Scaling Data Volume and Diversity

2. Constraining Update Dynamics

Explicitly bounds weight, feature, and gradient norms to prevent error amplification under bootstrapped critic updates:

Architecture Design

  • Inverted Residual Blocks: Inspired by Transformers, expands features to higher dimensions then projects back with residual connections
  • Pre-activation Normalization: Applies batch normalization before nonlinearities to handle non-stationary replay data
  • Post-RMS Normalization: Bounds per-sample feature norms to prevent rare inputs from destabilizing bootstrapping
FlashSAC Architecture

Training Techniques

  • Cross-Batch Value Prediction: Concatenates current and next transitions into a single batch for consistent normalization statistics
  • Distributional Critic: Represents Q-values as categorical distributions over atoms, smoothing the optimization landscape
  • Adaptive Reward Scaling: Normalizes rewards to keep returns within fixed support while preserving optimal policy
  • Weight Normalization: Projects weight vectors onto unit-norm sphere after each update to prevent uncontrolled weight growth
Component Ablation Study

3. Scaling Model for Faster Training

With stable off-policy learning, scaling laws from supervised learning apply: larger models trained with larger batches and fewer updates converge faster. FlashSAC uses a 2.5M-parameter, 6-layer network with batch size 2048 and a UTD ratio of 2/1024.

Hyperparameter Ablation Study

Citation

@article{kim2026flashsac,
  title={FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control},
  author={Kim, Donghu and Lee, Youngdo and Park, Minho and Kim, Kinam and Seno, Takuma and
          Nahrendra, I Made Aswin and Min, Sehee and Palenicek, Daniel and Vogt, Florian and
          Kragic, Danica and Peters, Jan and Choo, Jaegul and Lee, Hojoon},
  journal={arXiv preprint arXiv:2602},
  year={2026}
}