What?
Lifelong learning with adding new elements to the action set of an MDP when there is some relationship between the elements of the action space.
Why?
Lifelong learning setting is practically important, and adding new action to the action set is a reasonable requirement.
How?
source: original paper
So, we are in the RL with function approximation, and adding new actions is quite challenging because it requires changing the output dimension of the neural network (or a linear model if you are into that kind of thing).
The paper introduces the formalism that is useful for proofs but is not helping to grasp the general picture. I'll try to explain it in words. With any new episode, the agent's action set can change. We want to adapt to that change and learn faster than just learning from scratch or adding new output dimensions would do.
First, before even thinking about learning, the authors show nice theoretical result showing that in an L-MDP (when the actions of the new MDP subsume the action set of the old MDP), in the limit, the optimal value function of the new MDP converges to the optimal value function of the MDP with all the actions. So, the more actions we add, the closer the optimal value function of the current MDP is to the optimal value function of the hypothetical MDP (I will call it mega-MDP hereafter) with all the actions.
What do we do when we add learning on top? We split the policy reparameterisation into two:
- $\beta: \mathcal{S}\times \mathcal{\hat{E}} \rightarrow [0,1]$
- $\hat{\phi}: \mathcal{\hat{E}}\times \mathcal{A} \rightarrow [0,1]$
- Let's do some self-supervision on top:
- Learn inverse dynamics $\varphi$: $\mathcal{S}
\times\mathcal{S}\rightarrow \mathcal{\hat{E}}$
.
- Use $\hat{\phi}$ to go from $e\in\mathcal{E}$ to $a \in \mathcal{A}$
- Train $\hat{\phi}$ and $\varphi$ to predict the action that leads the $s \rightarrow s'$ transition.
- The authors show the bound on the difference between the current MDP value function and the value function of the mega-MDP that depends on the supremum of the KL between the transition probability for the actual action and the one we get from using $\hat{\phi}$ and $\varphi$.
- In practice, it's hard to go through all the states and actions, the authors propose to minimize the average over states and actions from the observed transitions instead of the supremum.
- We don't need the reward here and can use self-supervision only with the transitions data.
The main idea of LAICA (Lifelong Adaptation And Improvement for Changing Actions):
- Split learning into two phases:
- Adaptation:
- Sample random rollouts and store them in the buffer.
- Update $\hat{\phi}$ and $\varphi$ using the data from above;
- Policy Improvement
- Use your favourite algorithm (actor-critic) but with $\beta$ and $\hat{\phi}$ instead of a monolithic policy
And?
- I think the paper is great! I found two drawbacks:
- The paper overuses the word 'structure of the action set'. It is very vague and unclear what it means at least before page 4. I believe the authors wanted to say that there are some similarity between the actions and we can have a distance function between two elements of the actions space embedded to some euclidean space.
- The paper mentions wolpertinger, but does not compare to it. While it was not directly designed for lifelong learning, I think it should be in the paper as a baseline.
- The L-MDP setting assumes that transition probabilities are $\rho$-Lipschitz in the actions: $\forall s,s', e_i, e_j\;\; \lvert\lvert p(s'\mid s, e_i) - p(s'\mid s, e_j )\rvert \rvert_1 \leq \rho \lvert\lvert e_i-e_j \rvert \rvert_1$ . This is where the 'structure' becomes more or less clear here. We want to have a metric in the action space. I wish the authors provided more intuition/discussion about whether this holds (or almost) in practical settings.
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.