What?

A study of zero-shot generalisation to new environments with the same dynamics, but different observational features.

Why?

An RL agent can overfit to a lot of stuff, e.g. dynamics, exploit determinism and totally avoid using observations. Varying only observation function and keeping the rest of the MDP the same provides a convenient way to study one type of overfitting: observational overfitting.

How?

And?


This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.