Observational Overfitting in Reinforcement Learning

What?

A study of zero-shot generalisation to new environments with the same dynamics, but different observational features.

Why?

An RL agent can overfit to a lot of stuff, e.g. dynamics, exploit determinism and totally avoid using observations. Varying only observation function and keeping the rest of the MDP the same provides a convenient way to study one type of overfitting: observational overfitting.

How?

Fun stuff:
- Showing only scoreboard as the observation to the agent leads to it learning a policy bringing 5k in reward.
- Blacking out the scoreboard, test performance increases by 10%.
- The background is also highlighted by saliency maps, most likely because moving background objects get associated with the game progress.
- Generalisation is an underconstrained problem, there are many solutions to the training tasks, and the optimal policy might not be suitable to the test task.
  - Though one could hope that implicit regularization might help.
Problem setting:
- We have a distribution of MDPs.
- We sample from it to get the train set.
- The difference between the performance on the train set and performance on the whole distribution is the generalisation gap.
- We consider a distribution of parameterized MDPs (or, more precisely, POMDPs), where the parameter $\theta$ specifies the observation function only, the rest of the MDP stays the same.
- The authors consider $(f,g)$-scheme
  - There are useful $f$ and not-useful $g$ features in the observations.
  - The final observation function is a function of the two above: $\phi_\theta(s) = h(f(s), g_\theta(s))$
  - As an example consider concatenating important game observational features (such as monsters) with a background that has no effect on dynamics.
  - task id defines the seed to sample the noise.
- regularisation
  - explicit
    - easy, just penalise the policy looking at $g_\theta$;
  - implicit
    - much more interesting;
    - overparametrization;
Experiments:
- LQR
  - An analog to analyzing linear/logistic regression in supervised learning.
  - All of the minima are global minima.
  - Observation includes important state features as well as noise projected to a much higher dimension than those of the important part: $o_t = [W_c, W_\theta]s_t$, where $W_c \in \mathbb{R}^{d_\text{signal}\times s}$ and $W_\theta \in \mathbb{R}^{d_\text{noise}}$, and $d_{\text{signal}} \ll d_{\text{noise}}$.
  - 1-step LQR is not able to remove the noise component and overfitting occurs.
  - In 2-step LQR, the gap grows as $\mathcal{O}(\sqrt{d_\text{noise}})$.
  - Overparameterization (both increasing width and depth) reduces the generalisation gap (and also reduces the norms of the final policy), i.e. overparametrization biases the policy towards simpler models.
- Projected Gym
  - Similar setting to the above, but using PPO.
  - Generalisation gap depends on the task (dynamics)
  - Overparametrization might help
  - Also might hurt!
    - e.g. tanh → vanishing grads.
- Deconvolutional projections
  - Nice idea to avoid complexities of learning from pixels, but using the CNN architectures.
  - Take the important features and do deconvolutions.
  - Combine with noise as in previous approaches.
  - Different arhcitectures have different generalisation properties.
    - NatureCNN cannot solve CartPole
    - Interestingly, the generalisation gap ranking is similar across the tasks, i.e. NatureCNN is always below IMPALA.
  - Showing only the noise to the environment checks memorisation capabilities
    - NatureCNN can remember 30 levels, but not 50
    - IMPALA is really bad in memorisation.
- CoinRun
  - The authors study a whole bunch of architectures to measure generalisation gap
  - Inception for the win.
  - IMPALA LARGE BN is close to the Inception.
  - Overparametrization helps generalisation in CoinRun.
  - Predicting generalisation gap from the training phase is ~~hard~~sci-fi in RL.

And?

The paper is as cool as it is confusing. To my liking, it tries to cram a lot of stuff into a conference paper format and it hurts it. I wish some of the background concepts were explained in more details and with more intuition (e.g. margin distributions). The paper assumes knowing a lot of context which I do not possess. And this makes reading it hard.
NatureCNN vs IMPALA architecture difference on CartPole is just surreal
From my experience, CNN architectures used in RL are tiny compared to what is used in computer vision. It was interesting to see how the choice of architecture might change the performance in the Deep RL setting.

This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.