Inductive Representation Learning on Large Graphs

What?

Learning node embeddings based on the subsample of their surroundings in the graph with an ability to generalise to unseen graphs.

Why?

Existing approaches (when the paper was published) assumed a single fixed graph. We need something that generalises to new graphs.

How?

source: original paper

TLDR: Learn $k$ aggregators, one for each hop.

For forward propagation:

For each depth ($k\in K$)
- uniformly sample the neighbourghood of a node and aggregate it;
  - the graphs are big, we can save on the compute if we subsample.
- concatenate aggregated stuff with previous iteration node features;
  - Mean: $h_v^k \leftarrow \sigma(W\cdot\texttt{MEAN}(\{h_v^{{k-1}}\} \cup \{h_u^{k-1}, \forall u\in \mathcal{N}(v)\}))$
  - LSTM: not permutation invariant, but will do.
  - Pooling: $h_v^k \leftarrow \texttt{MAX}(\{\sigma(W\cdot h^k_u+b),\forall u\in \mathcal{N}(v)\})$, where the pooling is done element-wise.
- apply an MLP;
- normalise by $L_2$ norm.
For unsupervised learning, make neighbours to have similar embeddings and distant nodes different:
- $J(z_u) = -\log{(\sigma(z^T_uz_v))}- Q\mathbb{E}_{v_n\sim P_n(v)}\log{(\sigma(-z^T_uz_v))}$ where $Q$ is the number of negative samples.
GraphSAGE becomes WL-test if
- $K$ = $\lvert \mathcal{V}\rvert$
- Using identities instead of weight matrices
- Using a hash function without non-linearity.

My traditional pseudo~~science~~code:

def forward(V, aggregators, W, depth, nonlinearity):
  v_prev = V
  v = V.copy()
  n_nodes = V.shape[0] # [N_nodes, Node_features]
  for k in range(depth):
    for i in range(hprev.size()):
      agg = aggregators[k](sample_neighbours(i, v_prev, k))
      v[i] = nonlinearity(W*concat([v_prev[i], agg]))
      v_prev = l2norm(v) #for some reason, in the paper this is outside of the inner loop
  return v

And?

I like how clear the paper is. After the setting is clear, it's very easy to grasp. However, sometimes the notation is not uniform across the paper (e.g. using $u$ for neighbours in average aggregation, and $u_i$ for max.
The loss in Equation 1 got me very confused though:
- Not sure why we have an expectation and then Q where it's just a sum?
- Also there's double negation for the negative samples (pun intended)?
- Also, not clear why there is only one positive node $v$ and a whole bunch of negative samples.
The authors show the results that for graphs where each node has different features, one can approximate the clustering coefficient of a node with arbitrary precision. I wonder how practical this result is, especially when for graphs without features one uses node degrees instead. (→ different nodes will have same features)
I wonder how subsampling the neighbours will work in other tasks rather than representation learning at the node level. I believe this works well when the graphs are huge and we assume the nodes to get similar representation to those of their neighbours, but in the RL setting, for instance, where each node emits an action, this would be harmful.