Universal Value Function Approximators

What?

A value function parameterisation that includes a goal apart from the state, i.e. $V^\pi_{g_1}(s;\theta), ..., V^\pi_{g_n}(s;\theta)$ becomes $V^\pi(s,g;\theta)$.

Why?

Value functions are a big deal in RL. If we can generalise a value function to other tasks (e.g. different goals), that would be great.

How?

The paper deals with goal-directed environments, that can be represented as a set of tasks with the same transition dynamics, but different reward functions. They call it a pseudo-reward function $R_g(s,a,s')$.

They define the value function as

$$ V_{g, \pi}(s) \coloneqq \mathbb{E}\big[\sum_{t=0}^\infty R_g(s_{t+1}, a, s_t)\prod_{k=0}^t \gamma_g(s_k)\mid s_0=s\big], $$

where $\gamma_g(s)$ is a state and goal-dependent discounting function which also accounts for the soft termination ($\gamma(s)=0 \texttt{ iff } s=g$).

The main goal of the paper is to learn a single function approximating a set of goal-specific value function, i.e. to turn $V_{g_1,\pi}(s) , ..., V_{g_n,\pi}(s)$ into $V_\pi(s, g)$.

An obvious thing to do here is to concatenate the two vectors $s$ and $g$. However we can do better, and the authors come up with a nice idea of using low-rank approximation here.

The value function will be trained in a two stage regime:

Represent all values in a matrix $M$, where each row represents a state, and each column represents a goal given $V_g^*(s)$. Do the low-rank approximation to find the parameters $\hat{\phi}_s, \hat{\psi}_g$, so that $\hat{\phi}_s\cdot \hat{\psi}_g = M$.
Learn the parameters of neural networks $\phi$ and $\psi$ using $\hat{\phi}_s$ and $\hat{\psi}_g$ as targets, i.e. we can split the regression problem into two here.

For a full-blown RL we do not have the optimal values, and we have to be creative. The authors propose two methods:

Use HORDE to learn a set of value functions and use those to construct the matrix $M$ and do low-rank approximation on.
Do the bootstrapping $Q(s_t, a_t, g) \leftarrow \alpha(r_g+\gamma_g\max_{a'}Q(s_{t+1}, a', g) + (1-a)Q(s_t, a_t, g)$.

I'll write down pseudo~~science~~code for the HORDE version below (Algorithm 1 in the paper):