What?
Sampled MuZero, a MuZero extension for continuous and (large) discrete action spaces.
Why?
Would be cool to have one algorithm to work with arbitrary action spaces.
How?
source: original paper
- The main idea is to compute an improved policy over a small subset of action space elements.
- Operator view of Policy Improvement:
- Policy improvement is a composition of improvement $\mathcal{I}$ and projection $\mathcal{P}$ (since the improved policy might be unrealisable) → PI = $\mathcal{P} \circ \mathcal{I}$.
- Policy Gradient:
- $\mathcal{I}\pi(s,a) \propto \pi(s,a)Q(s,a)$
- PPO
- $\mathcal{I}\pi(s,a) \propto \exp{Q(s,a)/\tau}$
- MPO
- $\mathcal{I}\pi(s,a) \propto \pi(s,a)\exp{Q(s,a)/\tau}$
- Sample-based action-independent policy improvement operator:
- $\hat{\mathcal{I}}\beta\pi(s,a) = \hat{\beta}/{\beta}(a\mid s) f(s,a,\hat{\mathcal{Z}}\beta(s))$, where $f$ is called an action-independent operator ($\mathcal{I}\pi(s,a) =f(s,a,\mathcal{Z}(s)$).
- $\hat{\mathcal{Z}}$ is a normalisation factor.
- The authors prove that
- $\mathbb{E}{a\sim \mathcal{I}\pi}[X\mid s] = \lim{K\rightarrow \infty}\sum_{a \in \mathcal{A}}\hat{\mathcal{I}}_\beta\pi(a\mid s)X(s,a)$
- As a corollary, the sample-based PI operator converges in distribution to the true PI operator:
- $\lim_{K\rightarrow \infty}\hat{\mathcal{I}}_\beta\pi = \mathcal{I}\pi$,
- which is cool since we can subsample the actions now and still do PI.
- Sampled MuZero
- Naive
- Do MCTS over the sampled actions $\{a_i\}$ and use PUCT (as typically Alpha/Mu Zero) using policy $\pi$.
- Do the correction for search as $\hat{\mathcal{I}}
_\beta\pi = \hat{\beta}/
{\beta}f.$
- Smarter
- Do PUCT with a modified policy $(\hat{\beta}/\beta\pi)(s,a)$ instead of $\pi$.
- The rest of MuZero can be left unchanged!
- The projection part is done with minimising the KL: $\text{KL}(\mathcal{I\hat{\pi}\beta}||\pi\theta)$.
- For node expansion, we sample a subset of actions compared to expanding for all of the actions.
- HOW DO WE SAMPLE (how to choose $\beta)$?
- Uniform?
- $\beta = \pi$, Dirischlet noise for exploration (exploration before PUCT).
And?
- I find the paper is interesting, but I seem to miss the following point: how is using $\beta=\pi$ different from sampling from a policy with a low temperature? In the latter case, we end up with having sampled the modes → same as sampling the modes first and resampling actions from those.
- In the first theorem, the authors introduce a random variable $X$. What's this in our setting? Is that the reward?
- It is surprising how few actions we need to sample with $\beta$ to get good performance. However, I believe this is more about the benchmarks we use rather than the property of the algorithm.
- I know that it's popular to discretise continuous action spaces, but this sounds like a bad substitute for a benchmark with a large discrete action space. In the latter, the semantics of the actions might be very different, however, with discretised ones, even when we subsample, this might be good enough not to miss out something important because the sampled subset can make up for the missing actions.
- I got broken here:
- Sparse sampling algorithms (Kearns et al., 1999) are an effective approach to planning in large state spaces. The main idea is to sample K possible state transitions from each state, drawn from a generative model of the underlying MDP... In contrast, our method samples actions rather than state transitions... however, we focus in this paper upon the novel aspect relating to large action spaces and utilise deterministic transition model
- What is the difference between sampling from the state-space or sampling actions when the transitions are deterministic? Isn't it one-to-one correspondence between $s'$ and taking an action $a$ in state $s$?
- ...to estimate the gradients for the reverse KL divergence between the neural network policy estimate and the target MCTS policy, demonstrating some learning on the 1D pendulum task. 🤣
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.