AlphaGo Zero Explained
AlphaGo Zero is a much better Go player than the version that beat the game’s world champion. What’s more impressive is that it’s entirely self-taught . As a curious Go player (2500+ elo), I reproduced the algorithms in my spare time. To play with my AlphaGo Zero clone or train your own AI, check out my Python/Tensorflow open-sourced code.
The original 19x19 AlphaGo Zero by DeepMind was trained with 64 GPUs and 19 CPU servers. Due to the computing resource constraints, I reduced the board size to 5x5. I was able to successfully train an agent that places reasonable moves on my MacBook Pro within 36 hours. This model is included in the source code.
In this article, we will focus on the key concepts in AlphaGo Zero. The intended audience should have a basic grasp on what is a convolutional network (ConvNet) and what we mean by training and inference, but don’t need to have reinforcement learning background. We will look into how the residual network (ResNet), self-play and Monte Carlo Tree Search (MCTS) in AlphaGo Zero form an elegant system together.
A Brief Overview of Go
Go is an abstract strategy board game for two players, in which the aim is to surround more territory than the opponent. The stones can be "captured" if they are fully surrounded by the opponents' stones. The game was invented in ancient China more than 2,500 years ago and is believed to be the oldest board game continuously played today. It was considered one of the four essential arts of the cultured aristocratic Chinese scholars in antiquity.
Despite its relatively simple rules, Go is very complex. Compared to chess, Go has both a larger board with more scope for play and longer games, and, on average, many more alternatives to consider per move. The lower bound on the number of legal moves in Go has been estimated to be 2x10^{170}.
AI experts thought it could take until the year 2100, if not longer, for people to create an AI system that can beat a human at Go, according to a 1997 New York Times article posted on Reddit's futurology thread by user Yull-Ban. DeepMind's AlphaGo Master made history when it beat Go champion Lee Sedol four times in a five-game tournament in 2016.
Picking The Next Move
The AlphaGo Zero system has two major components that enable its superhuman Go capabilities: the residual network (ResNet) and the Monte Carlo Tree Search (MCTS) simulations. This section introduces how these two components work and explains how a move is picked during a Go game.
The Residual Network (ResNet)
Residual Network is a special type of convolutional neural networks with skip connections between layers. The ResNet in AlphaGo Zero takes the current board state as input, and outputs a policy vector and a value scalar.
The input for the ResNet is the current Go board represented by a 19x19x17 matrix. The 17 layers are made up from 16 layers representing the current board state and one layer representing the current player. The current board state includes the past 7 boards and current board, with each board represented by two 19x19 layers indicating positions of black and white stones. The historical boards are needed because of Go’s Ko rules, AlphaGo Zero can’t observe all the game information from just the current board.
The policy output maps each valid move (including pass) to a number between 0 to 1, indicating how likely the ResNet will play this move.
The value output is a float number between -1 and 1, representing if the ResNet predicts the current player will win from this current board state. 1 means current player will definitely win and -1 means the current player will definitely lose.
The policy and value outputs share the initial convolutional block and the subsequent residual blocks. However, the last few layers are different. These different layers are referred as the policy head and the value head. The policy and value heads use different type of activation functions, as shown in the diagram above.
Monte Carlo Tree Search Simulations
AlphaGo Zero uses a variant of MCTS simulations which boosts the performance of the current ResNet policy. Each move is decided by the result of 1600 simulations, which take approximately 0.4 seconds to run in the AlphaGo Zero settings. Steps a, b, c in the diagram below demonstrates what happens in each simulation.
In the search tree, each node represents a board state. Board states are adjacent when board 1 can turn into board 2 with just one move. Adjacent board states are connected by edges that each corresponds to a unique move, including pass. Each edge keeps track of the information below.
- N - the visit count: how many times we have reached this edge in the past simulations.
- W - The total action value for all board positions we have seen after taking this move. A board state's action value is calculated by feeding the board state into the ResNet and calculating the value output.
- Q = W/N - The average action value for all visits through this edge.
- P - The prior probability of selecting this edge. This is the policy output from the ResNet.
Step 1: Select an edge
Start from the the current board state, intuitively, we want to select an edge that we know will generate the highest chance of winning. The chance of winning is summarized in Q: the average action value for all known states reached by taking this action. Therefore, we always want to select an edge with the highest Q value.
However, there is a caviat to this approach. It only encapsulates the action values for the boards that we have seen before. In the beginning of the simulations, most edges are unexplored. Maximizing Q alone would cause the algorithm to always pick the best known action, but not explore the unknown actions that are potentially better. Therefore, we introduce the exploration term U and select the edge with the maximum Q + U.
Looking at the formula for U, we see that it's a combination of the policy output and the relative visit count for this edge comparing to the alternative moves. Cpuct is a constant controlling how fast exploration converges to the best policy, where a higher Cpuct means more exploration. P(s, a) is the prior probability for the current edge, and a higher prior probability results in a higher change that this edge is selected. The last term in the formula represents the relative visit count for this edge so far. If the edge has not been visited or has been visited for very few times, the algorithm is more likely to select this edge.
Note that for edges that correspond to moves made by the opponent of the root node player, Q + U is minimized because the goal for the opponent is to make the action value for the root player lower.
Step 2: Expand and evaluate the selected edge
After selecting the edge, we expand it to the next board position after taking this move. Then, we evaluate this new board position by feeding it into the ResNet. The ResNet outputs a policy and value for this new board. We store the value in the selected edge and store the policy in the edges connected to this new board.
Step 3: Backup the action value to ancestor edges
After evaluting the value for this newly discovered move, we need to back up the value for all its previous moves by iteratively updating the edges above it. The updating rule is simple: N = N + 1, W = W + node’s action value, Q = W/N. Stop when reaching the root node.
Step 4: Pick a move according to the simulation results
After repeating the simulation steps 1-3 for 1600 times, select the move with the highest visiting count N. Because in each simulation, the algorithm selects the best move to its knowledge while appropriately explores new moves, the move that's taken the most after all the simulations tends to be the best move. A larger number of simulations tend to generate better policies.
Training
The last section talks about how to make a move with an existing ResNet. In this section, we will elaborate on how the ResNet is trained.
Step 1: Random Initalization
In the beginning, initialize the weights of the ResNet as random values. This initial policy is very bad. We will work on improving it in the later steps.
Step 2: Self Play
With the current version of the ResNet, the AI plays a game against itself using the ResNet + MCTS simulations technique explained in the last section. A temperature constant is used to control the amount of exploration while selecting the move from the policy. A higher temperature constant was used in the first 30 steps of the games to encourage more exploration in the beginning. This prevents the AI from playing similar moves in all the self-play games.
When the self-play game terminates as two consecutive passes occur, evaluate the final board and determine the winner. Use all the historical board in this self play as training inputs, and the actual moves in this game and the final outcome as training labels for policy and value. Update the ResNet with the training data generated from this game.
Data augmention is optional in this training process. The historical Go boards can be rotated and flipped, while the labels should stay the same. Experience replay was used in the clone for faster training, but not in the original paper.
After the ResNet is updated, repeat the process above and start another self-play game. Update the ResNet again with the outcome of this game. Repeat until the AI is good enough.
Relavant Concepts
Amplification and Distillation
We can interpret the technique used in AlphaGo Zero as a form of “bootstrapping” or "iterative distillation and amplification”. The ResNet starts with a weak policy. In each iteration, we create a stronger policy using MCTS in combination with the ResNet. This amplifies the performance of the ResNet.
However, this amplified AI runs 1600 simulations for each move, which is highly costly. In the distillation step, we let the weak policy (ResNet) chase this stronger policy by training on the data generated by this stronger policy. The weak policy becomes stronger in each step, and eventually becomes very good. You can read more about this concept in Paul Christiano’s post.
Exploration
In reinforcement learning, there is a well-known exploration-exploitation tradeoff. An agent needs to balance trying out new moves with unknown payoffs (explore) against exploiting the knowledge it already has from the known moves (exploit).
Two techniques are used to encourage AlphaGo Zero to explore. First, a higher temperature constant was used in generating the policy from MCTS simulations for the first 30 steps of the game (The clone uses the first 3 steps), so the AI is tries more new moves when starting out. This stops the AI from playing similar moves in all the games during self-play. In addition, AlphaGo Zero adds some Dirichlet noise to the policy from MCTS to encourage further exploration.
Model Choice in Self-Play
In the AlphaGo Zero paper, the best historical ResNet model is used to generate the self-play training data. Our clone uses the trick in the later Alpha Zero paper, which updates directly on the newest model. This approach was shown better and faster empirically.
Data Augmentation
Due to the symmetric nature of the game, each board state can be augmented 12 data points: the original board, the 2 reflected boards and the 3 rotated boards. 6 more board states can be generated swapping the players. This works directly for our clone because Komi rules are omitted. In the original AlphaGo Zero, re-evaluation would be needed when players are swapped.
Summary
This post aims to convey the key ideas from the AlphaGo Zero paper. We talked about how AlphaGo Zero determines how to make the next move, as well as how the ResNet is trained. There are implementation details that we purposely omitted, including: the combined loss function for policy and value, the momentum optimizer, the number of layers used for the ResNet, etc. Please refer to the paper for these details.