What?

Lifelong learning with adding new elements to the action set of an MDP when there is some relationship between the elements of the action space.

Why?

Lifelong learning setting is practically important, and adding new action to the action set is a reasonable requirement.

How?

source: original paper

source: original paper

So, we are in the RL with function approximation, and adding new actions is quite challenging because it requires changing the output dimension of the neural network (or a linear model if you are into that kind of thing).

The paper introduces the formalism that is useful for proofs but is not helping to grasp the general picture. I'll try to explain it in words. With any new episode, the agent's action set can change. We want to adapt to that change and learn faster than just learning from scratch or adding new output dimensions would do.

First, before even thinking about learning, the authors show nice theoretical result showing that in an L-MDP (when the actions of the new MDP subsume the action set of the old MDP), in the limit, the optimal value function of the new MDP converges to the optimal value function of the MDP with all the actions. So, the more actions we add, the closer the optimal value function of the current MDP is to the optimal value function of the hypothetical MDP (I will call it mega-MDP hereafter) with all the actions.

What do we do when we add learning on top? We split the policy reparameterisation into two:

The main idea of LAICA (Lifelong Adaptation And Improvement for Changing Actions):

And?


This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.