Tiny-VLA
Tiny-VLA
Tiny-VLA

Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Junjie Wen*,1,2 Yichen Zhu*,1 Jinming Li1,3 Minjie Zhu1,2 Kun Wu4,5 Zhiyuan Xu5 Ran Cheng1 Chaomin Shen2 Yaxin Peng3 Feifei Feng1 Jian Tang5
1. Midea Group 2. East China Normal University 3. Shanghai University 4. Syracuse University
5. Beijing Innovation Center of Humanoid Robotics
*Indicates Equal Contribution,listed in alphabetical order.This work was done during Junjie Wen's internship in Midea Group.

Abstract

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that \methodname offers an interesting perspective on utilizing pre-trained multimodal models for policy learning.

Task Settings

teaser

There is a total of 5 tasks in single franka arm setting.

Experiments Results

teaser

Quantitative results in real-world experiments. We report the average success rate across multiple tasks and the count of trainable parameters for all models.

teaser

Quantitative results for bimanual UR5 real robot experiments. We report the average success rate over 10 trials. All models are trained in multi-task settings.

Generalization Experiments of our TinyVLA

1. Instruction generalization

Level: Understand Unseen color.
Instruction: Upright the tipped-over green mug.
1x In-domain
1x Out-of-domain
Level: Distinguish Seen objects.
Instruction: Pick the pink cube.
Notice: Both objects are seen in different tasks.
Our policy did not overfit to the seen trajectories but instead picked up the cube and then released it.
1x Out-of-domain
Level: Understand unseen objects & New function of seen objects.
Instruction: Pick the car and place into the box.
Notice: The toy car is totally unseen object.
Our policy can recognize the unseen toy car and place it into the box with human help.
1x Out-of-domain

2. Object&Appearance generalization

1x in-domain
1x Out-of-domain
1x in-domain
1x Out-of-domain
Instruction: Upright the tipped-over mug.
Instruction: Close the drawer.
1x in-domain
1x Out-of-domain
1x in-domain
1x Out-of-domain
Instruction: Open the lid of the box.
Instruction: Upright the tipped-over mug.

3. Background generalization

Instruction: Place the tennis ball into the ball box.
1x Deep blue desk mat
1x Yellow desk mat
Instruction: Stack the pink cube on top of the blue cube.
1x Gray desk mat
1x Wooden desk board

4. Distractors in bimanual settings

Instruction: Transfer the bread and place it into plate.
3x Origin
3x With distractors
Instruction: Stack the cubes on top of the plate.
3x Origin
3x With distractors
Instruction: Unzipping the bag and placing a tennis ball inside it.
3x Origin
3x With distractors

5. View Generalization

Instruction: Stack the pink cube on top of the blue cube.
teaser Origin
1x LeftView-30°
Instruction: Stack the pink cube on top of the blue cube.
teaser Origin
1x LeftView+15°
Instruction: Stack the pink cube on top of the blue cube.
1x RightView-15°
teaser Origin
Instruction: Stack the pink cube on top of the blue cube.
1x RightView+15°
teaser Origin

BibTeX

@article{wen2024tinyvla,
      title={TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation},
      author={Wen, Junjie and Zhu, Yichen and Li, Jinming and Zhu, Minjie and Wu, Kun and Xu, Zhiyuan and Cheng, Ran and Shen, Chaomin and Peng, Yaxin and Feng, Feifei and others},
      journal={arXiv preprint arXiv:2409.12514},
      year={2024}
    }