
Decoupling policy and value function in actor-critic methods and not letting the policy learn spurious correlations that hinder generalisation.


Having a rich visual observational space leads to policies suffering from spurious correlation and deteriorating generalisation properties. Sharing a network between the policy and the value function might make things worse. This paper investigates how decoupling those can help.




  1. Decouple policy from the value function. (DAAC)

  2. Hinder the policy feature extractor from overfitting to a particular problem instance. (IDAAC)

We are in a setting, where we have a finite number of samples from the distribution of tasks and want to generalise to new tasks (e.g. procgen). Obviously, overfitting to some features specific to a particular instance is bad since this is not going to generalise.


This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.