Implicit (model-free) planning using:
Implicit (model-free) planning is a developing area. Value Iteration Networks (VINs) formalize value iteration as a convolution of state values and transition probabilities → into max-pooling, in case the MDP is a grid-world. Generalised VINs drop the grid-world assumption.
Three main problems here:
TL;DR, source: https://arxiv.org/abs/2010.13146
TransE is an important component of XLVIN. I haven't read the original paper and describe TransE as it's described in XLVIN paper. First, the state is encoded to some latent space: $z: S\rightarrow\mathbf{R}^k.$ After that, to estimate the effect of a taken action, we add a translation vector to the latent: $h_{s,a_i} = h_s+T(h_s,a_i)$. So, the intuition behind this is "what is the latent representation of the state we end up in after taking an action in a state, whose latent representation is $h_s.$ To learn all of this, we use a triplet loss function:
$$ \mathcal{L} = d(z(s) + T(z(s),a),z(s')) + \max\{0, \xi - d(z(\tilde{s}), z(s')))\}, $$
where $\tilde{s}$ is the negative sample state.
This paper continues the work of Neural Graph Execution that learns to do classical CS algorithms supervising the model on intermediate results to improve their transfer properties. VI is a perfect algorithm to combine with neural execution due to the dynamic programming nature of it (check this paper to learn more about algorithmic alignment).
The main idea of the paper is on the picture above. My hand wavy python pseudo sciencecode below 🐍 :
def xlvin(state, z, gnn, pi_net, vi_net) -> Tuple(float):
h_s = z(s) # embed the state
S = [{} for _ in range(0, K)] # init embedding sets, K is math depth
S[0].add(h_s) # init depth-0 set of state embedding
E = {} # init edges set
Nb = {} # init neighbourhood map
for depth in range(1, K):
for h in S[depth-1]:
for a in action_set:
h_next = h + T(h,a) # compute expected neighbour embedding
S[depth] += {h_next} # add embedding to the graph
E = E += {(h, h_next, a)} # update edges
for h in S[depth-1]: # define the neighbourhood of h
Nb[h] = [e.h_next for e in E if e.h == h and e.a in action_set]
# run the execution model on graph (S,E) for K iterations
# get output features for the node that stored s embedding
h_out_emb = gnn(S,E, Nb).get_node_features(h_s)
return pi_net(h_s, h_out_emb), vi_net(h_s, h_out_emb)
All this beauty is trained using PPO with $T$ pretrained on samples from a random policy and executor pre-trained on a dataset of MDPs (or some random graphs, e.g. Erdos-Renyi or Barabasi-Albert). Everything is validated on a known MDP (8x8 grid worlds), continuous space observations (classical RL envs) and pixel-based envs (Atari).
I like how clearly written the paper is. On a high level, it's extremely easy to grasp, especially if you know a bit about GNNs and RL.
I'm still not sure about the motivation behind using a neural network (including GNNs) for mimicking a well-established CS algorithm. Yes, it would be cool to initialise GNNs with Dijkstra, train it a bit and get a better Dijkstra, but I doubt this is possible. Algorithms are so discrete, that it's hard to search in the space (whatever it means). If you add a tiny bug in your implementation of a CS algorithm, everything will go to hell. It's very hard to get a better algorithm by making a bug when implementing the old one.
<aside> 💡 It's not really about improving Dijkstra. It's more about making classical algorithms applicable even if you don't satisfy the preconditions
</aside>
Related to the previous one, I'd like the conclusion section to have something on the future work to understand this work in a more general context.
The authors show that you can reuse an executor trained on one task applying it for the other, however, Table 1 (below) has an interesting result. XLVIN-CP (pretrained on Cartpole) is much better on Acrobot compared to the random one (XLVIN-R). But XLVIN-R is not that different from XLVIN-CP on Cartpole itself. This is really weird and I'm not sure what it means from the transfer perspective. It's not normal that a biased network helps on a new task more compared to the task it was trained on. According to Petar, this is probably related to stopping the Cartpole as soon as 195+ threshold is hit.