A fast and stable off-policy RL algorithm that achieves the highest asymptotic performance in the shortest wall-clock time among existing methods for high-dimensional sim-to-real robotic control
If you're using PPO, try FlashSAC!
Scaling RL to high-dimensional robots remains challenging:
The instability of off-policy methods primarily stems from critic training. Critics are trained by minimizing a bootstrapped Bellman error:
$$\mathcal{L}_Q = \mathbb{E}_{(s,a,r,s') \sim \mathcal{B}} \left[ \left( Q_\theta(s, a) - (r + \gamma Q_{\bar{\theta}}(s', a')) \right)^2 \right]$$
In high-dimensional settings, target values at next state-action pairs \((s', a')\) are often poorly covered by the replay buffer, and since targets depend on the critic's own predictions, small errors compound through repeated bootstrapping. FlashSAC addresses this by scaling data diversity, constraining update dynamics, and scaling model size once stability is ensured.
Explicitly bounds weight, feature, and gradient norms to prevent error amplification under bootstrapped critic updates:
With stable off-policy learning, scaling laws from supervised learning apply: larger models trained with larger batches and fewer updates converge faster. FlashSAC uses a 2.5M-parameter, 6-layer network with batch size 2048 and a UTD ratio of 2/1024.
@article{kim2026flashsac,
title={FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control},
author={Kim, Donghu and Lee, Youngdo and Park, Minho and Kim, Kinam and Seno, Takuma and
Nahrendra, I Made Aswin and Min, Sehee and Palenicek, Daniel and Vogt, Florian and
Kragic, Danica and Peters, Jan and Choo, Jaegul and Lee, Hojoon},
journal={arXiv preprint arXiv:2602},
year={2026}
}