4 minute read

Recent research has successfully incorporated large vision-language models (VLMs) into low-level robotic control through supervised fine-tuning (SFT) using expert robotic datasets, giving rise to what are now referred to as vision-language-action (VLA) models. While these VLA models exhibit considerable capability, fine-tuning via behavior cloning (BC) imposes an upper bound on their generalization potential and scalability due to the demonstration collection.

In response, recent efforts have attempt to integrate reinforcement learning (RL) into VLA learning pipeline to address this limitation. By leveraging RL for fine-tuning or for curating demonstration data, researchers are increasingly able to refine these initially broad but suboptimal VLA models into agile robotic policies, equipped with both high precision and robust generalization.

Iterative Reinforcement Learning

Naively deploying online RL process with large neural networks often leads to severe instability and catastrophic forgetting pre-trained knowledge. To mitigate this and effectively refine pre-trained VLA models, Guo et al. ICRA 2025 iterates between stages of online RL and supervised learning:

  • Stage 1: RL with frozen VLM
  • Stage 2: Behavior Cloning using expert + collected dataset

This approach is applied across hybrid environments that mixed RL environments (Metaworld and Franka Kitchen) and the real-world Panda Manipulation. For real-world tasks lacking a ground-truth reward function, one can train a reward classifier such as VICE or alternatively, define a handcrafted reward when the robot’s internal state provides a reliable measure of success.

(a) Pre-training on Expert Dataset (b) Iterative between Online RL and Supervised Learning on Both Expert and Online-collected Dataset

$\mathbf{Fig\ 1.}$ (a) Pre-training on Expert Dataset (b) Iterative between Online RL and Supervised Learning on Both Expert and Online-collected Dataset (Guo et al. 2025)


This iterative process offers numerous advantages:

  • (a) Improved Performance in both original and RL Tasks
    • The agent continues to refine its capabilities in both seen expert tasks and RL tasks through online interaction.
  • (b) Improved Generalization in Unseen Tasks
    • As the agent tackles an increasing variety of tasks automatically, its capacity to generalize improves in tandem.
    • e.g., after mastering four types of window tasks in Metaworld, the agent effectively generalized to windows of unseen colors and shapes.

Real-world experiments with panda arm

$\mathbf{Fig\ 2.}$ Real-world experiments with panda arm. The standard RL results in real-world tasks are not reported since directly fine-tuning the entire VLA model exceeded the computational capabilities of the authors' local machine. (Guo et al. 2025)


Reinforcement Learning Distilled Generalists (RLDG)

Recent advances in robotic foundation models have paved the way for generalist policies capable of adapting to diverse tasks. While these models show impressive flexibility, their performance heavily depends on the quality of their training data. Furthermore, previous approaches typically rely on fine-tuning VLA models with a limited number of demonstrations (e.g. Open-VLA fully fine-tunes the model with 10 to 150 demonstrations). These demonstrations, often obtained via human teleoperation, are inherently unscalable and incur significant costs.

Reinforcement Learning Distilled Generalists (RLDG) leverages RL to generate high-quality training data for fine-tuning generalist policies, improving the performance and generalizability of pre-trained VLA.

  1. RL Policy Training
    Solve small, narrow tasks with RL for multiple tasks (e.g. $\pi_\texttt{ethernet}$, $\pi_\texttt{USB}$, $\pi_\texttt{VGA}$, ...)
  2. Experience Collection
    Collect data of downstream tasks with trained RL agent.
  3. Generalist Supervised Fine-Tuning
    Finetune VLA model with supervised learning with those data.

RLDG improves generalist robot policies like OpenVLA and Octo by training with specialist RL policies and using them to generate high-quality fine-tuning dataset

$\mathbf{Fig\ 3.}$ RLDG improves generalist robot policies like OpenVLA and Octo by training with specialist RL policies and using them to generate high-quality fine-tuning dataset (Xu et al. 2025)


The authors evaluate the effectiveness of RLDG across four real-world manipulation tasks that present unique challenges. These include high-precision, contact-rich tasks that often hinder the performance of generalist models, and complex multi-stage assembly tasks.

RLDG Experimental Setups: Tasks and Robots

$\mathbf{Fig\ 4.}$ RLDG Experimental Setups: Tasks and Robots (Xu et al. 2025)


In both OpenVLA and Octo, fine-tuning with RL-generated demonstrations consistently surpasses performance achieved through human-collected data. When compared to human demonstrations, fine-tuning generalist policies using RLDG demonstrates superior sample efficiency for both in-distribution and unseen tasks: VLA with RLDG reached a $100\%$ success rate with just $45$ RL episodes, whereas human demonstrations required $300$ episodes to attain equivalent performance.

Success rate comparison of OpenVLA and Octo policies fine-tuned with RLDG versus conventional methods using human demonstrations

$\mathbf{Fig\ 5.}$ Success rate comparison of OpenVLA and Octo policies fine-tuned with RLDG versus conventional methods using human demonstrations (Xu et al. 2025)


The authors argued that RL data can surpass human-collected data, particularly because RL excels at fine-grained control, unlike human teleoperation. Since the tasks under study demand precise, nuanced manipulation, it is also hard for human teleoperator and human-collected data is not optimal anymore. If we can make reward function appropriately, an RL-derived policy can outperform its human counterpart in such scenarios.

In the following figure (left), the RL action distribution demonstrates a higher concentration of probability in the correct direction (bottom-left) that guides the end-effector towards the insertion point In contrast, the human action distribution clusters near the center, showing only a slight bias in the correct direction. This highlights that RL actions are more optimal than human actions, resulting in the better sample efficiency for fine-tuning in right figure.

Success rate comparison of OpenVLA and Octo policies fine-tuned with RLDG versus conventional methods using human demonstrations

$\mathbf{Fig\ 6.}$ Success rate comparison of OpenVLA and Octo policies fine-tuned with RLDG versus conventional methods using human demonstrations (Xu et al. 2025)


References

[1] Guo et al. “Improving Vision-Language-Action Model with Online Reinforcement Learning”, ICRA 2025
[2] Xu et al. “RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning”, ICRA 2025

This is the first post in this category. This is the most recent post in this category.

Leave a comment