GSOC article

Intro

This year I participated in Google Summer of Code in The Julia Language organization. I was creating an implementation of MuZero - Deepmind’s algorithm, based on AlphaZero.jl.

The result of it is MuZero repo

What is MuZero?

MuZero is a model-based reinforcement learning algorithm that learns a model of the environment during training. Like AlphaZero, MuZero uses MonteCarloTreeSearch (MCTS) during policy evaluation, but in contrast to AlphaZero, it has three distinct networks:

prediction (f) - takes a state and return a policy and a value - exactly like AlphaZero’s net
dynamics (g) - takes state, action and returns next state and reward - MuZero is learning the dynamics of environment it plays, AlphaZero has dynamics given
representation (h) - takes observation (ex. game board) and returns state in latent space, AlphaZero operates on assumption state=observation (in real game-space)

Training loop consist of two parts:

self-play, in which MuZero plays against itself and save trace of game into the memory
learning, when MuZero loads states from memory, analyzes them (compute loss) and updates network’s weights

For now it was only tested on games, but maybe in the future it could be used in complicated environments where rules are not known to humans (such as real world).

If you haven’t read yet:

MuZero: Mastering Go, chess, shogi and Atari without rules - DeepMind article

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model - Julian Schrittwieser YouTube video

Main motivations

Two main goals of this gsoc project were:

MuZero is underexplored, see how it differs from AlphaZero
how much code MuZero and AlphaZero.jl could share

Exploration of MuZero

When initially explored, it differs from AlphaZero quite a bit.

Weak initial signal - because MuZero lives in its own latent world at start, and it takes some time to align it with the real world, MCTS is basically useless at the beginning - it builds tree with states that have nothing to do with reality. In contrast, AlphaZero’s MCTS consists of real observations, and is useful right from the beginning. That enables generating a lot of samples first, and then step into a longer learning phase to analyze them - pipeline closer to supervised learning, which is more forgiving at tuning hyper params.¹

Solved tictactoe - took longer than AlphaZero, which have quite good results at first iteration due to MCTS.

Challenging - even the flagship implementation muzero-general can’t solve tictactoe properly. It shows signs of learning, but it is benchmarking against equivalent of minmax depth=1, so when tested against the human it doesn’t play well #152.

MuZero and AlphaZero share some code, mostly higher-abstraction code (starting simulation, playing games, etc) as well as MCTS, but it’s implementation was optimized for AlphaZero, where one state could be achieved through different routes (leafs could have more parents), but MuZero operates on states that are outputs from neural networks, that means they will never be exactly the same. But if MCTS would be created from scratch specifically for MuZero, it would probably looked differently [see section what’s next on coming up technical blogpost].

In future they’ll probably share more code, as MuZero worked like a test field for things that will come to AlphaZero.jl (like using Base.Logging #55).

Debugging / tuning hypers

My GSoC experience started smoothly. I was ahead of schedule, and it seemed that MuZero would solve tic tac toe two weeks before the first evaluation period. But I underestimated the difficulty of debugging the reinforcement learning system. Andy Jones’ phrase

“you might have a few hundred lines of code that you think are correct in an hour, and a system that’s actually correct two months later”

accurately describes my experience, as I spent most of GSoC finding bugs and tuning hyper params. These two things are closely related in reinforcement learning, and you never know if your system’s poor performances are due to one of the hyperparameters being way off or to a bug in the code.

Because debugging RL systems is so hard, I will be telling my own debugging story below, highlighting some of the general principles and strategies that I found most helpful.

After initial feature implementation, and getting rid of compile-time bugs, I’d got something that I thought “should work”’ with tic tac toe, but the result looked like this:

losses

Losses (aside from L2 regularisation) didn’t change. None of the metrics I had logged showed any sign of learning nor pointed to a specific problem, and I didn’t know where to look out for bugs.

All I knew was that something was wrong, probably in the learning part. But in the reinforcement learning systems everything is connected and it’s hard to isolate a part of it. In order to disentangle the self-play component from the learning component, Jonathan suggested that we filled the memory buffer with high-quality samples that we generated using an existing tic-tac-toe solver. MuZero could learn that data, like a supervised learning system.

After focusing on the learning part, I was disabling parts of the system to narrow possibilities where something could get wrong. This includes actions like the disabling g network in computing loss (unrolling for 0 steps), substituting h and g to behave like AlphaZero. After taking a break, I started learning steps with only one state in memory, loss started to decrease, but when traces had more states nothing worked again. Finally investigation found out that there was:

traces = [sample_trace(memory) for _ in 1:hyper.loss_computation_batch_size]
trace_pos_idxs = (sample_position(t) for t in traces)
samples = (make_target(gspec, t, i, hyper) for (t,i) in zip(traces, trace_pos_idxs))

where, there should be:

traces = [sample_trace(memory) for _ in 1:hyper.loss_computation_batch_size]
trace_pos_idxs = [sample_position(t) for t in traces]
samples = (make_target(gspec, t, i, hyper) for (t,i) in zip(traces, trace_pos_idxs))

(noticed these brackets?) sample_position(t) chooses state index randomly, and with generator (round brackets) a new one was selected each time samples were iterated through.

Then loss started to decrease properly, and MuZero started learning something, albeit extremely slowly.

Following advice from Andy Jones’ article to find out bugs I created the simplest, probe environments like

two actions, whoevers choose 1 wins,
line with n poles, whoever gets two adjacent wins

to test if MuZero could handle them. And it solved them really well, so I thought that the problem must be only with hyperparams. But something came to my mind and I swapped the starting player from ‘white’ to ‘black’, and MuZero started to behave strangely. It turns out that there was a problem with swapping the perspective of reward, as MuZero’s dynamics function returns reward from the perspective of the current player, and MCTS asks for the reward viewed as white player. I fixed that bug in the wrapper, added proper comments that helped to see clearly what’s going on in the code and hopefully prevent that kind of bugs in the future.

At that time MuZero handled probe envs well, so probably there weren’t more bugs, but it still struggled with tic tac toe. I was looking for a game simpler than TTT, but more advanced than these probe envs. Then realised that TTT with some initial beads on the board is simpler than the whole game.

The plan that we discussed with Jonathan at the meeting was to behave like a scientist and:

borrow hyper params from muzero-general,
create initial state with ‘n’ beads on board
(optional) generate (or load previously generated) samples with minmax/mcts-rollout player and pretrain MuZero on those OR use networks from simpler initial state as starting point [that could speed up learning by solving problem of weak initial loss]
change one parameter and write down hypothesis what is expected outcome of this (how the curves should change)
train MuZero normally (with self-playing)
examine if hypothesis was correct
if solved game, then start with less beads on board (n -= 1) [goto 1], / otherwise start experiment again, changing different hyperparam, or the same by another amount [goto 2/3]

With that method MuZero achieved good results quite quickly, as solving easier problems is generally quicker (less possible actions, and shorter game length) and less sensitive to hyper params. Also often hyperparams that are good for solving game from state no.(n+1) are similar to those that solve game from state no.(n). And the network from the easier variant is quite a good starting point for the harder one.

Method to start with supervised samples as starting point aside from isolating learning part, helps at start to overcome initial misalignment of latent states with observation - creates good starting point for networks, and may be useful for checking if network isn’t too small (have too few trainable parameters), as it should overfit to small number of samples.

Partially that is how I observed (the rest played intuition) that the default neural network width may be too small to solve the whole game. Combined with rest params being quite good, it solved tic tac toe.

benchmarks

Detailed explanation about what things to look at, how curves should behave I recommend great article by Andy Jones Debugging RL, Without the Agonizing Pain

Summing up:

start with simpler variance of the problem
use good samples for supervised learning
tackle tuning params like a scientist (or like a rich man, when you’ve got half the world computing power to run grid-search)

Personal journey

I really enjoyed my summer coding. It was a great experience. Meeting with Jonathan shined a light on things to try next, especially when I hit a dead end. I learned a lot during this summer, especially about Julia and open-source workflow. I would recommend participating to GSoC to anyone who loves the idea of free code and is motivated to do some cool project.

Julia section

During profiling (which currently gives correct results only with -t 1 single thread) I discovered strange hash performance behaviour #41880.

Discovered also that TensorBoardLogger get stuck, when @log message from multiple threads #106.

Due to often CUDNN_STATUS_EXECUTION_FAILED (code 8)” errors (like here #57), gpu support is currently experimental and MuZero crashes at random moments.

Other things

During GSoC I done some other loosely related things, such as:

added some functionality to AlphaZero.jl: support for mancala game #42 #44
created wrapper for OpenSpiel.jl #68

Also, we were discussing integration ReinforcementLearning.jl - left for the future investigation, since RL.jl would need a lot of things to develop before. Discussion distributed RL - although due to my lack of experience participated only as a listener, got some valuable insights and broaden horizons.

Acknowledgments

Jonathan Laurent (main mentor) for creating AlphaZero.jl - backbone of this implementation, valuable insights at meetings and overall, being best kind of mentor
Jun Tian (second mentor) for ReinforcementLearning.jl, valuable discussions about it and distributed computing, and for connecting me with Jonathan
JuliaGPU team for access to cyclops - GPU cluster
Werner Duvaud and other creators for muzero-general package that was source of inspiration

The correlations between real observation and latent state aren’t examined yet (to my knowledge), and there may be some interesting results hidden there. Specially in stochastic environments - will MuZero create some kind of superposition ? ↩