Research
Scaling up learning across many different robot types
Together with partners from 33 academic labs, we have pooled data from 22 different robot types to create the Open X-Embodiment dataset and RT-X model
Robots are great specialists, but poor generalists. Typically, you have to train a model for each task, robot, and environment. Changing a single variable often requires starting from scratch. But what if we could combine the knowledge across robotics and create a way to train a general-purpose robot?
Today, we are launching a new set of resources for general-purpose robotics learning across different robot types, or embodiments. Together with partners from 33 academic labs we have pooled data from 22 different robot types to create the Open X-Embodiment dataset. We also release RT-1-X, a robotics transformer (RT) model derived from RT-1 and trained on our dataset, that shows skills transfer across many robot embodiments.
In this work, we show training a single model on data from multiple embodiments leads to significantly better performance across many robots than those trained on data from individual embodiments. We tested our RT-1-X model in five different research labs, demonstrating 50% success rate improvement on average across five different commonly used robots compared to methods developed independently and specifically for each robot. We also showed that training our visual language action model, RT-2, on data from multiple embodiments tripled its performance on real-world robotic skills.
We developed these tools to collectively advance cross-embodiment research in the robotics community. The Open X-Embodiment dataset and RT-1-X model checkpoint are now available for the benefit of the broader research community, thanks to the work of robotics labs around the world that shared data and helped evaluate our model in a commitment to openly and responsibly developing this technology. We believe these tools will transform the way robots are trained and accelerate this field of research.
Open X-Embodiment Dataset: Collecting data to train AI robots
Datasets, and the models trained on them, have played a critical role in advancing AI. Just as ImageNet propelled computer vision research, we believe Open X-Embodiment can do the same to advance robotics. Building a dataset of diverse robot demonstrations is the key step to training a generalist model that can control many different types of robots, follow diverse instructions, perform basic reasoning about complex tasks, and generalize effectively. However, collecting such a dataset is too resource-intensive for any single lab.
To develop the Open X-Embodiment dataset, we partnered with academic research labs across more than 20 institutions to gather data from 22 robot embodiments, demonstrating more than 500 skills and 150,000 tasks across more than 1 million episodes. This dataset is the most comprehensive robotics dataset of its kind.
RT-X: A general-purpose robotics model
RT-X builds on two of our robotics transformer models. We trained RT-1-X using RT-1, our model for real-world robotic control at scale, and we trained RT-2-X on RT-2, our vision-language-action (VLA) model that learns from both web and robotics data. Through this, we show that given the same model architecture, RT-1-X and RT-2-X are able to achieve greater performance thanks to the much more diverse, cross-embodiment data they are trained on. We also show that they improve on models trained in specific domains, and exhibit better generalization and new capabilities.
To evaluate RT-1-X in partner academic universities, we compared how it performed against models developed for their specific task, like opening a door, on corresponding dataset. RT-1-X trained with the Open X-Embodiment dataset outperformed the original model by 50% on average.
Emergent skills in RT-X
To investigate the transfer of knowledge across robots, we conduct experiments with our helper robot on tasks that involve objects and skills that are not present in the RT-2 dataset but exist in another dataset for a different robot. Specifically, RT-2-X was three times as successful as our previous best model, RT-2, for emergent skills.
Our results suggest that co-training with data from other platforms imbues RT-2-X with additional skills that were not present in the original dataset, enabling it to perform novel tasks.
RT-2-X demonstrates skills that the RT-2 model was not capable of previously, including better spatial understanding. For example, if we ask the robot to "move apple near cloth" instead of "move apple on cloth" the trajectories are quite different. By changing the preposition from "near" to "on", we can modulate the actions that robot takes.
RT-2-X shows that combining data from other robots into the training improves the range of tasks that can be performed even by a robot that already has large amounts of data available – but only when utilizing a sufficiently high-capacity architecture.
Responsibly advancing robotics research
Robotics research is at an exciting, but early, juncture. New research shows the potential to develop more useful helper robots by scaling learning with more diverse data, and better models. Working collaboratively with labs around the world and sharing resources is crucial to advancing robotics research in an open and responsible way. We hope that open sourcing the data and providing safe but limited models will reduce barriers and accelerate research. The future of robotics relies on enabling robots to learn from each other, and most importantly, allowing researchers to learn from one another.
This work demonstrates that models that generalize across embodiments are possible, with dramatic improvements in performance both with robots here at Google DeepMind and on robots at different universities around the world. Future research could explore how to combine these advances with the self-improvement property of RoboCat to enable the models to improve with their own experience. Another future direction could be to further probe how different dataset mixtures might affect cross-embodiment generalization and how the improved generalization materializes.
Partner with us: open-x-embodiment@googlegroups.com
Notes
We would like to thank the co-authors of this work: Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Huang, Christine Chan, Chuer Pan, Chuyuan Fu, Coline Devin, Danny Driess, Deepak Pathak, Dhruv Shah, Dieter Büchler, Dmitry Kalashnikov, Dorsa Sadigh, Edward Johns, Federico Ceola, Fei Xia, Freek Stulp, Gaoyue Zhou, Gaurav S. Sukhatme, Gautam Salhotra, Ge Yan, Giulio Schiavi, Hao Su, Hao-Shu Fang, Haochen Shi, Heni Ben Amor, Henrik I Christensen, Hiroki Furuta, Homer Walke, Hongjie Fang, Igor Mordatch, Ilija Radosavovic, Isabel Leal, Jacky Liang, Jaehyung Kim, Jan Schneider, Jasmine Hsu, Jeannette Bohg, Jeffrey Bingham, Jiajun Wu, Jialin Wu, Jianlan Luo, Jiayuan Gu, Jie Tan, Jihoon Oh, Jitendra Malik, Jonathan Tompson, Jonathan Yang, Joseph J. Lim, João Silvério, Junhyek Han, Kanishka Rao, Karl Pertsch, Karol Hausman, Keegan Go, Keerthana Gopalakrishnan, Ken Goldberg, Kendra Byrne, Kenneth Oslund, Kento Kawaharazuka, Kevin Zhang, Keyvan Majd, Krishan Rana, Krishnan Srinivasan, Lawrence Yunliang Chen, Lerrel Pinto, Liam Tan, Lionel Ott, Lisa Lee, Masayoshi Tomizuka, Maximilian Du, Michael Ahn, Mingtong Zhang, Mingyu Ding, Mohan Kumar Srirama, Mohit Sharma, Moo Jin Kim, Naoaki Kanazawa, Nicklas Hansen, Nicolas Heess, Nikhil J Joshi, Niko Suenderhauf, Norman Di Palo, Nur Muhammad Mahi Shafiullah, Oier Mees, Oliver Kroemer, Pannag R Sanketi, Paul Wohlhart, Peng Xu, Pierre Sermanet, Priya Sundaresan, Quan Vuong, Rafael Rafailov, Ran Tian, Ria Doshi, Roberto Martín-Martín, Russell Mendonca, Rutav Shah, Ryan Hoque, Ryan Julian, Samuel Bustamante, Sean Kirmani, Sergey Levine, Sherry Moore, Shikhar Bahl, Shivin Dass, Shuran Song, Sichun Xu, Siddhant Haldar, Simeon Adebola, Simon Guist, Soroush Nasiriany, Stefan Schaal, Stefan Welker, Stephen Tian, Sudeep Dasari, Suneel Belkhale, Takayuki Osa, Tatsuya Harada, Tatsuya Matsushima, Ted Xiao, Tianhe Yu, Tianli Ding, Todor Davchev, Tony Z. Zhao, Travis Armstrong, Trevor Darrell, Vidhi Jain, Vincent Vanhoucke, Wei Zhan, Wenxuan Zhou, Wolfram Burgard, Xi Chen, Xiaolong Wang, Xinghao Zhu, Xuanlin Li, Yao Lu, Yevgen Chebotar, Yifan Zhou, Yifeng Zhu, Ying Xu, Yixuan Wang, Yonatan Bisk, Yoonyoung Cho, Youngwoon Lee, Yuchen Cui, Yueh-hua Wu, Yujin Tang, Yuke Zhu, Yunzhu Li, Yusuke Iwasawa, Yutaka Matsuo, Zhuo Xu, Zichen Jeff Cui.
The authors would like to thank Arielle Bier, Dimple Vijaykumar, Gabriella Pearl, Jane Park, Katie McAtackney, Juanita Bawagan, Eleanor Tomlinson, Dex Hunter-Torricke for their help in creating the content for the blog. We would also like to thank John Guilyard for the amazing animations used for this website. We are thankful to Sanah Choudhry, Michael Griessel, Jon Small for their legal advice. We would like to acknowledge Yuheng Kuang, Ning Hou, Utsav Malla, Sarah Nguyen, Rochelle Dela Cruz, Justice Carbajal, Brianna Zitkovich, Emily Perez, Elio Prado, Jodilyn Peralta, Tran Pham, Deeksha Manjunath, Samuel Wan, Jaspiar Singh and the greater Google DeepMind team for their feedback and contributions.