What?

Decoupling policy and value function in actor-critic methods and not letting the policy learn spurious correlations that hinder generalisation.

Why?

Having a rich visual observational space leads to policies suffering from spurious correlation and deteriorating generalisation properties. Sharing a network between the policy and the value function might make things worse. This paper investigates how decoupling those can help.

How?

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/205e386d-a66f-4a00-8666-feff167c58bb/Untitled.png

TLDR:

  1. Decouple policy from the value function. (DAAC)

  2. Hinder the policy feature extractor from overfitting to a particular problem instance. (IDAAC)

We are in a setting, where we have a finite number of samples from the distribution of tasks and want to generalise to new tasks (e.g. procgen). Obviously, overfitting to some features specific to a particular instance is bad since this is not going to generalise.

And?


This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.