Research
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
TL;DR: AlphaStar is the first AI to reach the top league of a widely popular esport without any game restrictions. This January, a preliminary version of AlphaStar challenged two of the world's top players in StarCraft II, one of the most enduring and popular real-time strategy video games of all time. Since then, we have taken on a much greater challenge: playing the full game at a Grandmaster level under professionally approved conditions.
Our new research differs from prior work in several key regards:
- AlphaStar now has the same kind of constraints that humans play under – including viewing the world through a camera, and stronger limits on the frequency of its actions* (in collaboration with StarCraft professional Dario “TLO” Wünsch).
- AlphaStar can now play in one-on-one matches as and against Protoss, Terran, and Zerg – the three races present in StarCraft II. Each of the Protoss, Terran, and Zerg agents is a single neural network.
- The League training is fully automated, and starts only with agents trained by supervised learning, rather than from previously trained agents from past experiments.
- AlphaStar played on the official game server, Battle.net, using the same maps and conditions as human players. All game replays are available here.
We chose to use general-purpose machine learning techniques – including neural networks, self-play via reinforcement learning, multi-agent learning, and imitation learning – to learn directly from game data with general purpose techniques. Using the advances described in our Nature paper, AlphaStar was ranked above 99.8% of active players on Battle.net, and achieved a Grandmaster level for all three StarCraft II races: Protoss, Terran, and Zerg. We expect these methods could be applied to many other domains.
Learning-based systems and self-play are elegant research concepts which have facilitated remarkable advances in artificial intelligence. In 1992, researchers at IBM developed TD-Gammon, combining a learning-based system with a neural network to play the game of backgammon. Instead of playing according to hard-coded rules or heuristics, TD-Gammon was designed to use reinforcement learning to figure out, through trial-and-error, how to play the game in a way that maximises its probability of winning. Its developers used the notion of self-play to make the system more robust: by playing against versions of itself, the system grew increasingly proficient at the game. When combined, the notions of learning-based systems and self-play provide a powerful paradigm of open-ended learning.
Many advances since then have demonstrated that these approaches can be scaled to progressively challenging domains. For example, AlphaGo and AlphaZero established that it was possible for a system to learn to achieve superhuman performance at Go, chess, and shogi, and OpenAI Five and DeepMind’s FTW demonstrated the power of self-play in the modern games of Dota 2 and Quake III.
At DeepMind, we’re interested in understanding the potential – and limitations – of open-ended learning, which enables us to develop robust and flexible agents that can cope with complex, real-world domains. Games like StarCraft are an excellent training ground to advance these approaches, as players must use limited information to make dynamic and difficult decisions that have ramifications on multiple levels and timescales.
Despite its successes, self-play suffers from well known drawbacks. The most salient one is forgetting: an agent playing against itself may keep improving, but it also may forget how to win against a previous version of itself. Forgetting can create a cycle of an agent “chasing its tail”, and never converging or making real progress. For example, in the game rock-paper-scissors, an agent may currently prefer to play rock over other options. As self-play progresses, a new agent will then choose to switch to paper, as it wins against rock. Later, the agent will switch to scissors, and eventually back to rock, creating a cycle. Fictitious self-play - playing against a mixture of all previous strategies - is one solution to cope with this challenge.
After first open-sourcing StarCraft II as a research environment, we found that even fictitious self-play techniques were insufficient to produce strong agents, so we set out to develop a better, general-purpose solution. A central idea of our recently published Nature paper extends the notion of fictitious self-play to a group of agents – the League. Normally in self-play, every agent maximises its probability of winning against its opponents; however, this was only part of the solution. In the real world, a player trying to improve at StarCraft may choose to do so by partnering with friends so that they can train particular strategies. As such, their training partners are not playing to win against every possible opponent, but are instead exposing the flaws of their friend, to help them become a better and more robust player. The key insight of the League is that playing to win is insufficient: instead, we need both main agents whose goal is to win versus everyone, and also exploiter agents that focus on helping the main agent grow stronger by exposing its flaws, rather than maximising their own win rate against all players. Using this training method, the League learns all its complex StarCraft II strategy in an end-to-end, fully automated fashion.
Exploration is another key challenge in complex environments such as StarCraft. There are up to 1026 possible actions available to one of our agents at each time step, and the agent must make thousands of actions before learning if it has won or lost the game. Finding winning strategies is challenging in such a massive solution space. Even with a strong self-play system and a diverse league of main and exploiter agents, there would be almost no chance of a system developing successful strategies in such a complex environment without some prior knowledge. Learning human strategies, and ensuring that the agents keep exploring those strategies throughout self-play, was key to unlocking AlphaStar’s performance. To do this, we used imitation learning – combined with advanced neural network architectures and techniques used for language modelling – to create an initial policy which played the game better than 84% of active players. We also used a latent variable which conditions the policy and encodes the distribution of opening moves from human games, which helped to preserve high-level strategies. AlphaStar then used a form of distillation throughout self-play to bias exploration towards human strategies. This approach enabled AlphaStar to represent many strategies within a single neural network (one for each race). During evaluation, the neural network was not conditioned on any specific opening moves.
In addition, we found that many prior approaches to reinforcement learning are ineffective in StarCraft, due to its enormous action space. In particular, AlphaStar uses a new algorithm for off-policy reinforcement learning, which allows it to efficiently update its policy from games played by an older policy.
Open-ended learning systems that utilise learning-based agents and self-play have achieved impressive results in increasingly challenging domains. Thanks to advances in imitation learning, reinforcement learning, and the League, we were able to train AlphaStar Final, an agent that reached Grandmaster level at the full game of StarCraft II without any modifications, as shown in the above video. This agent played online anonymously, using the gaming platform Battle.net, and achieved a Grandmaster level using all three StarCraft II races. AlphaStar played using a camera interface, with similar information to what human players would have, and with restrictions on its action rate to make it comparable with human players. The interface and restrictions were approved by a professional player. Ultimately, these results provide strong evidence that general-purpose learning techniques can scale AI systems to work in complex, dynamic environments involving multiple actors. The techniques we used to develop AlphaStar will help further the safety and robustness of AI systems in general, and, we hope, may serve to advance our research in real-world domains.
Read more about this work in Nature
Publicly available paper here
View all Battle.net game replays
AlphaStar team
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcerhe, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, David Silver
Acknowledgements
We’re grateful to Dario Wünsch (TLO) , Grzegorz Komincz (MaNa), and Diego Schwimer (Kelazhur) for their advice, guidance, and immense skill. We are also grateful for the continued support of Blizzard and the StarCraft gaming and AI community for making this work possible–especially those who played against AlphaStar on Battle.net. Thanks to Ali Razavi, Daniel Toyama, David Balduzzi, Doug Fritz, Eser Aygün, Florian Strub, Guillaume Alain, Haoran Tang, Jaume Sanchez, Jonathan Fildes, Julian Schrittwieser, Justin Novosad, Karen Simonyan, Karol Kurach, Philippe Hamel, Ricardo Barreira, Scott Reed, Sergey Bartunov, Shibl Mourad, Steve Gaffney, Thomas Hubert, the team that created PySC2, and the whole DeepMind team, with special thanks to the research platform team, comms and events teams.
*Agents were capped at a max of 22 agent actions per 5 seconds, where one agent action corresponds to a selection, an ability and a target unit or point, which counts as up to 3 actions towards the in-game APM counter. Moving the camera also counts as an agent action, despite not being counted towards APM.