XLVIN: eXecuted Latent Value Iteration Nets

What?

Implicit (model-free) planning using:

TransE to infer MDP dynamics.
Graph Nets to process non-grid like MDPs.
Neural Algorithmic Execution to align planning with Value Iteration (VI).

Why?

Implicit (model-free) planning is a developing area. Value Iteration Networks (VINs) formalize value iteration as a convolution of state values and transition probabilities → into max-pooling, in case the MDP is a grid-world. Generalised VINs drop the grid-world assumption.

Three main problems here:

Models are not constrained to do VI.
VIN/GVINs assume discrete state spaces.
All of the approaches require the MDP structure in advance.

How?

TL;DR, source: https://arxiv.org/abs/2010.13146

TransE is an important component of XLVIN. I haven't read the original paper and describe TransE as it's described in XLVIN paper. First, the state is encoded to some latent space: $z: S\rightarrow\mathbf{R}^k.$ After that, to estimate the effect of a taken action, we add a translation vector to the latent: $h_{s,a_i} = h_s+T(h_s,a_i)$. So, the intuition behind this is "what is the latent representation of the state we end up in after taking an action in a state, whose latent representation is $h_s.$ To learn all of this, we use a triplet loss function:

$$ \mathcal{L} = d(z(s) + T(z(s),a),z(s')) + \max\{0, \xi - d(z(\tilde{s}), z(s')))\}, $$

where $\tilde{s}$ is the negative sample state.

This paper continues the work of Neural Graph Execution that learns to do classical CS algorithms supervising the model on intermediate results to improve their transfer properties. VI is a perfect algorithm to combine with neural execution due to the dynamic programming nature of it (check this paper to learn more about algorithmic alignment).

The main idea of the paper is on the picture above. My hand wavy python pseudo ~~science~~code below 🐍 :

def xlvin(state, z, gnn, pi_net, vi_net) -> Tuple(float):
  h_s = z(s) # embed the state
  S = [{} for _ in range(0, K)] # init embedding sets, K is math depth
  S[0].add(h_s) # init depth-0 set of state embedding
  E = {} # init edges set
  Nb = {} # init neighbourhood map
  for depth in range(1, K):
    for h in S[depth-1]:
      for a in action_set:
        h_next = h + T(h,a) # compute expected neighbour embedding
        S[depth] += {h_next} # add embedding to the graph
        E  = E += {(h, h_next, a)} # update edges
      for h in S[depth-1]: # define the neighbourhood of h
        Nb[h] = [e.h_next for e in E if e.h == h and e.a in action_set]

  # run the execution model on graph (S,E) for K iterations
  # get output features for the node that stored s embedding
  h_out_emb = gnn(S,E, Nb).get_node_features(h_s) 
return pi_net(h_s, h_out_emb), vi_net(h_s, h_out_emb)

All this beauty is trained using PPO with $T$ pretrained on samples from a random policy and executor pre-trained on a dataset of MDPs (or some random graphs, e.g. Erdos-Renyi or Barabasi-Albert). Everything is validated on a known MDP (8x8 grid worlds), continuous space observations (classical RL envs) and pixel-based envs (Atari).

And?

I like how clearly written the paper is. On a high level, it's extremely easy to grasp, especially if you know a bit about GNNs and RL.
I'm still not sure about the motivation behind using a neural network (including GNNs) for mimicking a well-established CS algorithm. Yes, it would be cool to initialise GNNs with Dijkstra, train it a bit and get a better Dijkstra, but I doubt this is possible. Algorithms are so discrete, that it's hard to search in the space (whatever it means). If you add a tiny bug in your implementation of a CS algorithm, everything will go to hell. It's very hard to get a better algorithm by making a bug when implementing the old one.
- Petar responded with an eye-opening comment:
<aside> 💡 It's not really about improving Dijkstra. It's more about making classical algorithms applicable even if you don't satisfy the preconditions

</aside>
Related to the previous one, I'd like the conclusion section to have something on the future work to understand this work in a more general context.
The authors show that you can reuse an executor trained on one task applying it for the other, however, Table 1 (below) has an interesting result. XLVIN-CP (pretrained on Cartpole) is much better on Acrobot compared to the random one (XLVIN-R). But XLVIN-R is not that different from XLVIN-CP on Cartpole itself. This is really weird and I'm not sure what it means from the transfer perspective. It's not normal that a biased network helps on a new task more compared to the task it was trained on. According to Petar, this is probably related to stopping the Cartpole as soon as 195+ threshold is hit.