David Silver - Mastering the game of Go without human knowledge (2017)

History / Edit / PDF / EPUB / BIB /
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~485 words)
Machine learning

  • How is AlphaGo Zero different from natural selection/evolution?

  • The architecture from AlphaGo is simplified, merging the policy and value network into a single network

  • A deep neural network fθ with parameters θ
  • The neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p,v)=fθ(s)
  • The vector of move probabilities p represents the probability of selecting each move a, pa=Pr(a|s)
  • The value v is a scalar evaluation, estimating the probability of the current player winning from position s
  • Combines both the roles of policy network and value network into a single architecture
  • The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities
  • In each position s, an MCTS search is executed, guided by the neural network fθ
  • The MCTS search outputs probabilities π of playing each move
  • The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network's parameters are updated to make the move probabilities and value (p,v)=fθ(s) more closely match the improved search probabilities and self-play winner (π,z)
  • Each edge (s,a) in the search tree stores a prior probability P(s,a), a visit count N(s,a), and an action value Q(s,a)
  • Each simulation starts from the root state and iteratively selects moves that maximizes an upper confidence bound Q(s,a)+U(s,a), where U(s,a)P(s,a)/(1+N(s,a)), until a leaf node s is encountered
  • MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π=αθ(s), proportional to the exponentiated visit count for each move, πaN(s,a)1/τ, where τ is a temperature parameter
  • First, the neural network is initialized to random weight θ0. At each subsequent iteration i1, games of self-play are generated
  • At each time-step t, an MCTS search πt=αθi1(st) is executed using the previous iteration of neural network fθi1 and a move is played by sampling the search probabilities πt
  • A game terminates at step T when both player pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length

  • A pure reinforcement learning approach is fully feasible
  • A pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data

  • Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354.