What?

A value function parameterisation that includes a goal apart from the state, i.e. $V^\pi_{g_1}(s;\theta), ..., V^\pi_{g_n}(s;\theta)$ becomes $V^\pi(s,g;\theta)$.

Why?

Value functions are a big deal in RL. If we can generalise a value function to other tasks (e.g. different goals), that would be great.

How?

The paper deals with goal-directed environments, that can be represented as a set of tasks with the same transition dynamics, but different reward functions. They call it a pseudo-reward function $R_g(s,a,s')$.

They define the value function as

$$ V_{g, \pi}(s) \coloneqq \mathbb{E}\big[\sum_{t=0}^\infty R_g(s_{t+1}, a, s_t)\prod_{k=0}^t \gamma_g(s_k)\mid s_0=s\big], $$

where $\gamma_g(s)$ is a state and goal-dependent discounting function which also accounts for the soft termination ($\gamma(s)=0 \texttt{ iff } s=g$).

The main goal of the paper is to learn a single function approximating a set of goal-specific value function, i.e. to turn $V_{g_1,\pi}(s) , ..., V_{g_n,\pi}(s)$ into $V_\pi(s, g)$.

An obvious thing to do here is to concatenate the two vectors $s$ and $g$. However we can do better, and the authors come up with a nice idea of using low-rank approximation here.

The value function will be trained in a two stage regime:

For a full-blown RL we do not have the optimal values, and we have to be creative. The authors propose two methods:

I'll write down pseudosciencecode for the HORDE version below (Algorithm 1 in the paper):