Tuesday 25 February 2025
Robotics researchers have made a significant breakthrough in developing a unified framework for visual-language-action (VLA) models, which enables robots to learn complex tasks by combining vision, language, and action. The new approach, called DiVLA, has been shown to outperform existing VLA models on various robotics tasks, including object selection, sorting, bin picking, and table bussing.
Traditionally, VLA models have relied on separate components for vision, language, and action, which can lead to limited flexibility and scalability. In contrast, DiVLA integrates these components into a single neural network architecture, allowing the model to learn from raw visual data, language instructions, and robot actions simultaneously.
The DiVLA framework consists of two main components: a diffusion-based policy for generating robot actions and a reasoning injection module for incorporating logical reasoning into the policy. The diffusion-based policy uses a novel diffusion process to predict the next action based on the current state of the environment and the robot’s previous actions. This approach enables the model to learn complex sequences of actions, such as manipulating objects in a specific order.
The reasoning injection module is responsible for integrating logical reasoning into the policy by injecting reason phrases into the policy network during training. These reason phrases are designed to encourage the model to think more critically about its actions and consider the consequences of its decisions.
To evaluate the performance of DiVLA, researchers conducted experiments on a variety of robotics tasks using real-world robots. The results show that DiVLA outperforms existing VLA models in terms of success rate, efficiency, and generalization ability. For example, DiVLA achieved a 60% success rate on a factory sorting task, compared to 40% for the previous best model.
DiVLA’s performance is also impressive when evaluated under challenging conditions, such as view shifting and cluttered scenes. The model’s ability to generalize to new viewpoints and handle cluttered environments demonstrates its robustness and flexibility in real-world applications.
The DiVLA framework has significant implications for robotics research and development. By enabling robots to learn complex tasks from raw visual data, language instructions, and robot actions, DiVLA paves the way for more advanced robotic systems that can interact with humans and adapt to new environments.
In addition to its technical contributions, DiVLA also highlights the importance of integrating logical reasoning into machine learning models.
Cite this article: “DiVLA: A Unified Framework for Visual-Language-Action Models in Robotics”, The Science Archive, 2025.
Robotics, Visual-Language-Action, Divla, Neural Networks, Object Selection, Sorting, Bin Picking, Table Bussing, Logical Reasoning, Machine Learning







