MuZero: Using a Basic Simulator to improve Search

An exploration of this paper

Introduction

Planning algorithms based on lookahead search have achieved remarkable success in artificial intelligence. Human world champions have been defeated in classic games like checkers, checss, go, and poker, and planning algorithms have had real-world impact in applications from logistics to chemical synthesis. However, these planning algorithms all rely on knowledge of the environment's dynamics, such as the rules of the game or an accurate simulator, preventing their direct applicatoin to real-world domains like robotics, industrial control, or intelligent assistants.

Model based reinforcement learning [] aims to address this issue by first learning a model of the environment's dynamics, and then planning with respect to the learned model. Typically, these models have either focused on reconstructing the true enviroment state [], or the sequenceo of full observations []. However, prior work [] remains far from the state of the art in visual domains, such as atari games[]. Instead, the most successful methods are based on model-free RL [] - that is, they estimate the optimal policy or value function directly from interactions with the environment. However, model-free algorithms are in turn far from the state of the art in domains that require precise and sophisticated lookahead, such as chess or Go.

In this paper, we introduce MuZero, a new approach to model-based RL that achieves state-of-the-art performance on atari 2600, a visual complex set of domains, while maintaining superhuman performance in precision planning tasks such as chess, shogi, and go. Muzero builds on AlphaZero's powerful search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure,. Muzero also extends alphazero to a broader set of environments including single agent domains and non-zero rewards at intermediate timesteps.

The main idea of the algorithm (summarized in figure 1) is ot predict those aspects of the future that are directly relevant for planning. The model receives the observation () as an input and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the past hidden state and a hypothetical next actin. At every one of these steps the model predicts the policy (the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. There is no direct constraint or requirement for the hidden state to capture all the information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement of the hidden state to match the true, unknown state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.

Prior work

A quite different approach to model-based RL has been developed, focusing on predicting the value function. The main idea of these methods is to construct an abstract MDP model such that planning in this MDP is equivalent to planning in the real environment. This equivalence is achieved by ensuring value equivalence, ie. starting from the same real state, the cumulative reward of a trajectory through the abstract MDP matches the cumulative reward of a trajetory in the real environment.

Predictron ⁴¹ first introduced value equivalent models for predicting value without actions. Although the underlying model still takes the form of an MDP, there is no requirement for its transition model to match the real states in the environment. Instead, the MDP model is viewed as a hidden layer of a deep neural network. The unrolled MDP is trained such that the expected cumulative sum of rewards matches the expected value with respect to the real environment, e.g. by temporal-difference learning.

As an example of this, imagine that we have a world that we want to model, which has some states. Now, we don't know what the states are. So how do you form an implicit MDP model?

Given a previous hidden state s^{k-1} and a candidate action a^{k} , the _dynamics_ function _g_ produces an intermediate reward r^k and a new hidden state s^{k}

So you can think about this as a simulator which only gives you the rewards for actions - after all, that's the most important thing in RL.

Value prediction networks ²⁸ are perhaps the closest precursor to MuZero; they learn an MDP model grounded in real actions; the MDP is trained such that the cumulative sum of rewards, conditioned on the actual sequence of actions, matches the real environment. Unlike MuZero there is no policy prediction and the search only utilizes value prediction

MuZero Algorithm

K = 5 steps 1 million mini-batches of size 2048 in board games and 1024 in Atari 800 simulations for each search in board games, and 50 simulations for each search in Atari

Uses the same residual and convolutional architecture as AlphaZero

Dynamics function uses the same architecture as the representation function, and the prediction function uses the same architecture as AlphaZero

256 hidden planes

Using 200 million frames of experience per game..

Appendix B search

Appendix D data generation

Making our own!

Let's try to make a mini-version which can play Pong :)

I don’t really like any of the social media platforms out there, so I decided to build my own. There’s nothing better than the feeling of being at home - somewhere cozy, a cup of hot chocolate, maybe a big dog, curled up reading a book. On the internet, I’d want my home to be flashy, with interesting things. Pages that fold in on themselves, letters which shout boldly and stand on invisible, precise lines, splashes of color, video. I want it to be a corner of the internet where you can go to relax, maybe hang out with people if you feel like it. It’s a little bit more than a static blog, a little bit more real-time, halfway between knowing a writer’s mind and speaking with them. It’s a blog which is certain to not be discovered, the opposite of most things. Or maybe you want to be discovered. By whom? I can’t really know these things.

value prediction
↩
↩