David Silver - Mastering the game of Go without human knowledge (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~485 words)
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~485 words)
- How is AlphaGo Zero different from natural selection/evolution?
- The architecture from AlphaGo is simplified, merging the policy and value network into a single network
- A deep neural network fθ with parameters θ
- The neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p,v)=fθ(s)
- The vector of move probabilities p represents the probability of selecting each move a, pa=Pr(a|s)
- The value v is a scalar evaluation, estimating the probability of the current player winning from position s
- Combines both the roles of policy network and value network into a single architecture
- The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities
- In each position s, an MCTS search is executed, guided by the neural network fθ
- The MCTS search outputs probabilities π of playing each move
- The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network's parameters are updated to make the move probabilities and value (p,v)=fθ(s) more closely match the improved search probabilities and self-play winner (π,z)
- Each edge (s,a) in the search tree stores a prior probability P(s,a), a visit count N(s,a), and an action value Q(s,a)
- Each simulation starts from the root state and iteratively selects moves that maximizes an upper confidence bound Q(s,a)+U(s,a), where U(s,a)∝P(s,a)/(1+N(s,a)), until a leaf node s′ is encountered
- MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π=αθ(s), proportional to the exponentiated visit count for each move, πa∝N(s,a)1/τ, where τ is a temperature parameter
- First, the neural network is initialized to random weight θ0. At each subsequent iteration i≥1, games of self-play are generated
- At each time-step t, an MCTS search πt=αθi−1(st) is executed using the previous iteration of neural network fθi−1 and a move is played by sampling the search probabilities πt
- A game terminates at step T when both player pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length
- A pure reinforcement learning approach is fully feasible
- A pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data
- Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354.